3.7 Adversarial Machine Learning Attacks¶
Overview¶
Adversarial ML attacks target the machine learning models powering the liveness system. Instead of fooling the camera with physical artifacts, they fool the neural network with carefully crafted input modifications.
Attack Types¶
White-Box Attacks (Attacker Has Model Access)¶
graph TD
A["Attacker obtains<br>model weights"] --> B["Computes gradient<br>of loss w.r.t.<br>input pixels"]
B --> C["Adds small<br>perturbation<br>to spoof image"]
C --> D["Perturbed image<br>fools model into<br>predicting 'live'"]
| Method | Perturbation Size | Success Rate | Detectability |
|---|---|---|---|
| FGSM (Fast Gradient Sign) | ε = 4/255 | 60-80% | Low perturbation — nearly invisible |
| PGD (Projected Gradient Descent) | ε = 8/255 | 85-95% | Moderate perturbation |
| C&W Attack | Minimal (optimized) | 95-99% | Minimal — optimized for imperceptibility |
| AutoAttack | Varies | 90-98% | Ensemble of attacks |
Black-Box Attacks (No Model Access)¶
| Method | Approach | Queries Needed | Success Rate |
|---|---|---|---|
| Transfer attack | Craft adversarial on surrogate model; hope it transfers | 0 (transfer) | 30-60% |
| Score-based | Use returned scores to estimate gradients | 1,000-10,000 | 50-80% |
| Decision-based | Only uses accept/reject decisions | 10,000-100,000 | 40-70% |
| Query-efficient | Combines transfer + score-based | 100-1,000 | 60-85% |
Model Extraction¶
graph TD
A["Attacker queries<br>liveness API<br>with diverse inputs"] --> B["Collects input-score<br>pairs (thousands)"]
B --> C["Trains surrogate<br>model to mimic<br>the API"]
C --> D["Crafts white-box<br>adversarial examples<br>on surrogate"]
D --> E["Adversarial examples<br>transfer to real model"]
Defenses¶
| Defense | Mechanism | Effectiveness | Cost |
|---|---|---|---|
| Adversarial training | Include adversarial examples in training data | 🟢 High — most effective single defense | Training time +50-100% |
| Input preprocessing | JPEG compression, spatial smoothing, bit-depth reduction before model | 🟡 Moderate — destroys some perturbations | Minimal |
| Ensemble models | Multiple diverse models; adversarial for one unlikely to fool all | 🟢 High — diversity is key | Inference cost ×N |
| Randomized smoothing | Add random noise to input; average predictions over multiple samples | 🟡 Moderate — certified robustness | Inference time ×K |
| Rate limiting | Limit API queries per device/IP/session | 🟢 High against query-based attacks | Minimal |
| Score obfuscation | Return binary decision instead of continuous score; add noise to scores | 🟢 High against score-based attacks | Minimal |
| Model diversity | Rotate between different model architectures periodically | 🟡 Moderate — prevents persistent extraction | Maintenance overhead |
| Input anomaly detection | Detect statistically anomalous input patterns that suggest adversarial manipulation | 🟡 Moderate | Additional model required |
Banking Recommendation¶
Defense-in-Depth Against Adversarial Attacks
- Adversarial training as standard practice in model training pipeline
- Never return raw scores to the client — binary decisions only via API
- Rate limit aggressively — max 3-5 attempts per session, 10 per device per day
- Ensemble of 2-3 diverse models for production inference
- Monitor score distributions — sudden shifts indicate possible adversarial probing
- Rotate models quarterly with architectural diversity
Back to: Attack Taxonomy Overview →