Skip to content

eKYC Encyclopedia

Data Pipeline Architecture

Data Pipeline Architecture¶

Definition¶

How eKYC data flows from capture to storage to analysis — ingestion, processing, enrichment, storage, and eventual deletion per retention policies.

Pipeline Flow¶

graph LR
    A[SDK Upload<br/>Encrypted images] --> B[Ingestion<br/>Validate, decompress]
    B --> C[Processing<br/>Face + Document + Screening]
    C --> D[Enrichment<br/>Risk scoring, dedup]
    D --> E[Storage<br/>Encrypted S3 + metadata DB]
    E --> F[Analytics<br/>Dashboards, reporting]
    E --> G[Retention Manager<br/>Auto-delete after 5 years]

Storage Tiers¶

Tier	Data	Storage	Retention
Hot	Active sessions, recent results	SSD, in-memory cache	Days-weeks
Warm	Completed verifications	Standard S3, PostgreSQL	1-5 years
Cold	Archived compliance records	Glacier/Archive	5+ years

Key Takeaways¶

Summary

Data pipeline must handle encryption end-to-end — images are biometric PII
Tiered storage optimizes cost — hot (active) → warm (compliance) → cold (archive)
Retention automation is required — GDPR/DPDP mandate deletion when purpose fulfilled