Skip to content

OCR Pipeline for ID Documents

Definition

The OCR (Optical Character Recognition) pipeline for identity documents transforms a captured document image into structured data — extracting fields like name, date of birth, document number, address, and expiry date with high accuracy.


The Three-Stage Pipeline

graph LR
    A[Document Image] --> B["Stage 1: Text Detection<br/>Where is text?"]
    B --> C["Stage 2: Text Recognition<br/>What does it say?"]
    C --> D["Stage 3: Field Mapping<br/>Which field is which?"]
    D --> E[Structured Output<br/>JSON with fields]

    style B fill:#1565C0,color:#fff
    style C fill:#6A1B9A,color:#fff
    style D fill:#2E7D32,color:#fff

Stage 1: Text Detection

Locating text regions in the document image:

Model Architecture Key Feature Speed
CRAFT VGG-16 + affinity Character-level detection, handles curved text 30-50ms
EAST PVANet + geometry Fast, compact — good for real-time 10-20ms
DBNet ResNet + differentiable binarization Adaptive thresholding, state-of-the-art 20-40ms
DBNet++ DBNet + adaptive scale fusion Improved multi-scale detection 25-45ms
PSENet Progressive scale expansion Handles closely spaced text well 30-50ms

Detection Output

For each text region: polygon coordinates (4+ points) enclosing the text.


Stage 2: Text Recognition

Reading the detected text regions:

Model Architecture Key Feature Accuracy
CRNN CNN + BiLSTM + CTC Classic, fast, reliable 95-98% (printed)
TrOCR ViT encoder + GPT decoder Transformer-based, high accuracy 98-99% (printed)
PaddleOCR PP-OCRv4 Lightweight CNN + SVTR Fast, multilingual, mobile-ready 97-99%
SVTR Scene text ViT Single visual model, no RNN 97-99%
ABINet Autonomous, Bidirectional, Iterative Language model correction built-in 98-99%

Recognition Accuracy by Content Type

Content Type Typical Accuracy Challenge
Printed Latin text 98-99.5% Standard, well-solved
Printed non-Latin 95-99% Script-dependent (Arabic harder than Chinese)
Handwritten text 70-90% Highly variable, personal style
Numbers/dates 99%+ Constrained vocabulary helps
MRZ (OCR-B font) 99.5%+ Fixed font designed for OCR
Damaged/faded text 60-85% Enhancement helps but limits exist

Stage 3: Field Mapping

Assigning recognized text to semantic fields:

Approach 1: Template-Based

graph TD
    A[Classified document type<br/>e.g., India Aadhaar PVC] --> B[Load template<br/>Known field positions]
    B --> C[Map detected text regions<br/>to template fields based on position]
    C --> D[Structured output<br/>name, DOB, Aadhaar number, address]
  • Pros: Fast, reliable for known templates
  • Cons: Breaks if document layout varies, requires template per document variant

Approach 2: Document Understanding Models

Model Architecture Key Innovation
LayoutLMv3 Multimodal Transformer (text + layout + image) Pre-trained on document understanding
LiLT Language-Independent Layout Transformer Layout knowledge transfers across languages
Donut End-to-end (no separate OCR needed) Image → JSON directly
DocFormer Multi-modal transformer Combines text, visual, and spatial features
UDOP Unified Document Processing Single model for all document tasks

LayoutLMv3 is the most widely used:

Input: Document image + OCR text + bounding box positions
→ Multimodal transformer processes all modalities jointly
→ Output: Field labels for each text region (name, DOB, id_number, etc.)

Approach 3: Hybrid

Template-based for known high-volume documents (Aadhaar, passport) + ML model for long-tail/unknown documents.


Post-Processing & Validation

Step What It Does Example
Date normalization Convert various date formats to ISO 8601 "12/03/1990" → "1990-03-12"
Name cleaning Remove artifacts, fix spacing "J O H N DOE" → "JOHN DOE"
Number validation Check digit validation for ID numbers Aadhaar: Verhoeff checksum
Cross-field validation DOB on front matches MRZ DOB Catch OCR errors
MRZ validation Check digits in MRZ (ICAO 9303) Mathematically verify MRZ integrity
Confidence scoring Per-field confidence based on recognition score Flag low-confidence fields for review

End-to-End Performance

Metric Target Typical Achievement
Field-level accuracy > 95% 96-99% (printed modern docs)
Document-level accuracy > 90% (all fields correct) 90-95%
Processing time < 3 seconds 1-3 sec (GPU server)
First-attempt success > 85% 80-90%

Key Takeaways

Summary

  • OCR pipeline has 3 stages: text detection (CRAFT/DBNet) → recognition (CRNN/TrOCR) → field mapping (LayoutLMv3/template)
  • LayoutLMv3 is the state-of-the-art for field extraction — combines text, layout, and image understanding
  • Printed text: 98-99% accuracy; handwritten: 70-90%; damaged: 60-85%
  • Post-processing validation (checksums, cross-field checks) catches remaining OCR errors
  • Hybrid approach works best: templates for known documents + ML for unknown/edge cases