Skip to content

Data Pipeline Architecture

Definition

How eKYC data flows from capture to storage to analysis — ingestion, processing, enrichment, storage, and eventual deletion per retention policies.


Pipeline Flow

graph LR
    A[SDK Upload<br/>Encrypted images] --> B[Ingestion<br/>Validate, decompress]
    B --> C[Processing<br/>Face + Document + Screening]
    C --> D[Enrichment<br/>Risk scoring, dedup]
    D --> E[Storage<br/>Encrypted S3 + metadata DB]
    E --> F[Analytics<br/>Dashboards, reporting]
    E --> G[Retention Manager<br/>Auto-delete after 5 years]

Storage Tiers

Tier Data Storage Retention
Hot Active sessions, recent results SSD, in-memory cache Days-weeks
Warm Completed verifications Standard S3, PostgreSQL 1-5 years
Cold Archived compliance records Glacier/Archive 5+ years

Key Takeaways

Summary

  • Data pipeline must handle encryption end-to-end — images are biometric PII
  • Tiered storage optimizes cost — hot (active) → warm (compliance) → cold (archive)
  • Retention automation is required — GDPR/DPDP mandate deletion when purpose fulfilled