Face Recognition Architectures
Definition
This article covers the specific deep learning architectures and loss functions used in modern face recognition systems โ the models that generate face embeddings for identity verification.
Loss Function Evolution
graph LR
A["Softmax<br/>(2014)"] --> B["Center Loss<br/>(2016)"]
B --> C["SphereFace<br/>(2017)"]
C --> D["CosFace<br/>(2018)"]
D --> E["ArcFace<br/>(2019)"]
E --> F["AdaFace<br/>(2022)"]
E --> G["ElasticFace<br/>(2022)"]
style E fill:#4051B5,color:#fff
style F fill:#2E7D32,color:#fff
ArcFace (Additive Angular Margin)
| Aspect |
Details |
| Core idea |
Add angular margin penalty m to the angle between feature and class center |
| Formula |
L = -log(exp(sยทcos(ฮธ + m)) / (exp(sยทcos(ฮธ + m)) + ฮฃ exp(sยทcos(ฮธj)))) |
| Default params |
s=64, m=0.5 |
| Benefit |
Creates clear geometric boundary between classes on hypersphere |
| Weakness |
Fixed margin โ same difficulty for easy and hard samples |
| Usage |
Most widely used in production eKYC systems |
AdaFace (Quality-Adaptive)
| Aspect |
Details |
| Core idea |
Adapt margin based on image quality โ harder margin for high-quality, softer for low-quality |
| Quality proxy |
Feature norm as quality indicator |
| Benefit |
Handles quality mismatch (ID photo vs selfie) that ArcFace struggles with |
| eKYC relevance |
Directly addresses the core eKYC challenge of cross-quality matching |
ElasticFace
| Aspect |
Details |
| Core idea |
Random margin sampled from distribution โ elastic class boundaries |
| Benefit |
More flexible decision boundaries, better generalization |
| Training |
Margin drawn from Gaussian or uniform distribution each iteration |
Backbone Architectures
Standard Server Backbones
| Backbone |
Params |
GFLOPs |
LFW |
AgeDB-30 |
CFP-FP |
Best For |
| IResNet-100 |
65M |
24.2 |
99.83 |
98.35 |
99.07 |
Production server |
| IResNet-50 |
44M |
12.3 |
99.80 |
97.95 |
98.62 |
Balanced server |
| IResNet-34 |
33M |
7.4 |
99.78 |
97.60 |
98.20 |
Efficient server |
Mobile/Edge Backbones
| Backbone |
Params |
GFLOPs |
LFW |
Speed (Mobile) |
Best For |
| MobileNetV3-Large |
5.4M |
0.45 |
99.50 |
15-30ms |
Standard mobile |
| MobileFaceNet |
0.99M |
0.22 |
99.55 |
5-15ms |
Ultra-lightweight |
| EfficientNet-B0 |
5.3M |
0.39 |
99.55 |
20-40ms |
Balanced mobile |
| GhostNet |
5.2M |
0.15 |
99.45 |
5-10ms |
Fastest mobile |
| EdgeNeXt-S |
5.6M |
1.0 |
99.60 |
15-25ms |
Modern efficient |
| Backbone |
Params |
GFLOPs |
LFW |
Notes |
| ViT-Small |
22M |
4.6 |
99.80 |
Competitive with CNNs |
| ViT-Base |
86M |
17.6 |
99.83 |
Matches ResNet-100 |
| DeiT-Small |
22M |
4.6 |
99.78 |
Distilled, efficient |
Training Pipeline
graph TD
A[Training Data<br/>MS1MV2 / Glint360K] --> B[Data Augmentation<br/>Flip, crop, color jitter]
B --> C[Backbone<br/>IResNet-100]
C --> D[BN + Dropout]
D --> E[FC Layer โ 512-d embedding]
E --> F[ArcFace Head<br/>s=64, m=0.5]
F --> G[Cross-Entropy Loss]
G --> H[SGD / AdamW Optimizer]
H --> I[Learning Rate Schedule<br/>Cosine annealing / step decay]
style F fill:#4051B5,color:#fff
Training Hyperparameters (Typical)
| Parameter |
Value |
| Batch size |
512 (distributed across GPUs) |
| Optimizer |
SGD (momentum=0.9, weight_decay=5e-4) |
| Learning rate |
0.1, decayed at epochs 20, 28, 32 |
| Total epochs |
34-40 |
| Embedding dim |
512 |
| Input size |
112 ร 112 |
| Augmentation |
Horizontal flip, random erasing |
Benchmarks
Verification Benchmarks
| Benchmark |
What It Tests |
Top Performance |
| LFW |
Labeled Faces in the Wild โ general FR |
99.83% (near-saturated) |
| AgeDB-30 |
Cross-age (30-year gap) |
98.35% |
| CFP-FP |
Cross-pose (frontal vs profile) |
99.07% |
| CALFW |
Cross-age with LFW pairs |
96.20% |
| CPLFW |
Cross-pose with LFW pairs |
93.37% |
| IJB-C |
Unconstrained verification (NIST) |
TAR=97% @ FAR=1e-4 |
Key Takeaways
Summary
- ArcFace remains the standard โ angular margin loss on hypersphere creates discriminative embeddings
- AdaFace is specifically relevant for eKYC โ handles quality mismatch between ID photos and selfies
- IResNet-100 is the standard server backbone; MobileFaceNet for mobile
- Vision Transformers are competitive but CNNs still dominate in production due to speed
- Training requires large-scale datasets (5M+ images) with clean labels
- LFW is saturated (99.8%+) โ real evaluation requires cross-age, cross-pose, and cross-quality benchmarks
Related Articles