Skip to content

Text Recognition Models

Definition

Text recognition (Stage 2 of OCR) reads the text content from detected regions, converting image patches containing text into character strings.


Key Models

Model Architecture Key Feature Accuracy (IC13)
CRNN CNN + BiLSTM + CTC Classic, fast, reliable baseline 92-95%
TrOCR ViT encoder + GPT-2 decoder Pretrained transformer, highest accuracy 97-99%
PaddleOCR PP-OCRv4 Lightweight CNN + SVTR Best practical system, mobile-ready 96-98%
SVTR Single Visual Transformer No RNN, attention-based 96-98%
ABINet Autonomous + Bidirectional + Iterative Built-in language model correction 97-98%
PARSeq Permutation-aware transformer Handles any reading order 97-99%

CTC vs Attention Decoding

Approach How It Works Pros Cons
CTC Align output to input sequence Fast, simple, no length limit Cannot model character dependencies
Attention Attend to relevant input parts Models dependencies, higher accuracy Slower, attention drift on long text
Hybrid CTC loss + attention decoder Best of both More complex training

ID Document-Specific Considerations

Content Best Approach Notes
Printed Latin Any model (well-solved) 98%+ accuracy
MRZ (OCR-B) Specialized MRZ recognizer 99.5%+ with checksum validation
Devanagari PaddleOCR / Tesseract with Indic models 92-96% accuracy
Arabic Right-to-left models, PaddleOCR 90-95% accuracy
Chinese Large character set models 95-98% (3000+ characters)
Handwritten HTR-specific models (TrOCR fine-tuned) 70-90% depending on quality

Key Takeaways

Summary

  • TrOCR provides highest accuracy; PaddleOCR is the best practical system (fast, multilingual)
  • CRNN + CTC remains a solid baseline for production systems
  • Recognition accuracy: printed (98%+), MRZ (99.5%+), handwritten (70-90%)
  • For ID documents, constrained vocabulary (known field types) allows post-processing correction