Skip to content

Document Understanding Models

Definition

Document understanding models go beyond OCR — they jointly process text, visual layout, and image features to extract structured data from documents. Instead of detecting text, reading it, then mapping fields separately, these models understand the entire document holistically.


Key Models

Model Year Architecture Key Innovation
LayoutLM 2020 BERT + 2D position embeddings First to combine text + layout
LayoutLMv2 2021 Multimodal (text + layout + image) Added visual features from image
LayoutLMv3 2022 Unified multimodal pretraining Unified text-image pretraining, no CNN needed
LiLT 2022 Language-Independent Layout Transformer Layout knowledge transfers across languages
Donut 2022 Swin Transformer → BART decoder End-to-end: image → JSON, no separate OCR
DocFormer 2021 Multi-modal transformer Shared positional encoding across modalities
UDOP 2023 Unified Document Processing Single model for all document tasks

LayoutLMv3 (Most Used in eKYC)

graph TD
    A[Document Image] --> B[Patch Embeddings<br/>Visual features from image patches]
    C[OCR Text] --> D[Word Embeddings<br/>Token-level text features]
    E[Bounding Boxes] --> F[Layout Embeddings<br/>2D position features]

    B & D & F --> G[Multimodal Transformer<br/>Cross-attention across modalities]
    G --> H[Field Classification Head<br/>name / DOB / id_number / address / ...]

    style G fill:#4051B5,color:#fff

LayoutLMv3 for eKYC

Aspect Details
Input Document image + OCR text + bounding boxes
Training Pre-trained on 11M documents, fine-tune on ID-specific data
Output Token-level labels: which text belongs to which field
Languages Works across languages (pre-trained on multilingual data)
Accuracy 95%+ field extraction accuracy on standard ID cards
Speed 50-200ms per document (GPU)

Donut (OCR-Free Alternative)

Advantage Disadvantage
No separate OCR needed Requires large training data
End-to-end trainable Slower than LayoutLMv3 + dedicated OCR
Naturally handles complex layouts Less interpretable (no intermediate OCR output)

Key Takeaways

Summary

  • LayoutLMv3 is the standard for document field extraction — multimodal (text + layout + image)
  • LiLT is ideal for multilingual eKYC — layout knowledge transfers across languages
  • Donut offers an OCR-free alternative — image directly to structured JSON
  • These models replace template-based extraction for diverse document types
  • Fine-tuning on ID-specific data is essential — general document models need adaptation