Knowledge Distillation¶
Definition¶
Knowledge distillation trains a small (student) model to mimic a large (teacher) model — compressing eKYC models for mobile/edge deployment while retaining most of the teacher's accuracy.
Distillation Pipeline¶
graph LR
A[Large Teacher Model<br/>ResNet-100, 65M params] --> B[Soft Labels<br/>Teacher's output distributions]
B --> C[Train Student Model<br/>MobileNet, 5M params]
D[Hard Labels<br/>Ground truth] --> C
C --> E[Compact Student<br/>Near-teacher accuracy at 1/10 the size]
style E fill:#2E7D32,color:#fff
Distillation for eKYC¶
| Teacher | Student | Task | Accuracy Retention |
|---|---|---|---|
| IResNet-100 (65M) | MobileFaceNet (1M) | Face recognition | 97-99% of teacher |
| ResNet-50 (25M) | CDCN-Lite (1M) | Face liveness | 95-98% of teacher |
| EfficientNet-B4 (20M) | MobileNetV3 (5M) | Document classification | 97-99% of teacher |
Key Takeaways¶
Summary
- Distillation enables mobile-deployable models with near-server-class accuracy
- Typical compression: 10-60x fewer parameters with 95-99% accuracy retention
- Soft label learning (teacher's probability distribution) provides richer training signal than hard labels alone
- Essential for on-device eKYC SDK models