16. Monitoring and Operations¶
Who should read this page¶
This page is mainly for platform teams, ML operations teams, fraud operations teams, backend engineers, and release owners.
Why this page exists¶
A liveness system is not finished when it goes live.
After launch, the real job becomes:
- monitoring behavior
- spotting drift
- detecting incidents
- protecting the user journey
- controlling regressions after updates
What should be monitored¶
A good monitoring plan covers four layers.
| Layer | What to monitor |
|---|---|
| user journey | pass rate, retry rate, completion rate |
| security | spoof acceptance trends, attack spikes, security-signal events |
| model behavior | score distributions, calibration drift, disagreement between models |
| infrastructure | latency, timeouts, API errors, device/platform failures |
A simple monitoring loop¶
flowchart TB
A[Live traffic] --> B[Metrics and logs]
B --> C[Dashboards and alerts]
C --> D[Incident triage]
D --> E[Fix, rollback,<br/>or retrain]
E --> F[Release validation]
Core business and UX metrics¶
| Metric | Why it matters |
|---|---|
| pass rate | overall flow success |
| retry rate | hidden friction and ambiguity |
| completion rate | customer conversion impact |
| manual review rate | operational burden |
| abandonment rate | whether users leave mid-flow |
These are often more visible to product teams than model-specific metrics.
Core security metrics¶
| Metric | Why it matters |
|---|---|
| spoof acceptance trend | direct fraud exposure signal |
| attack-type concentration | shows what attackers are trying |
| injection / virtual-camera detections | indicates advanced attack activity |
| high-risk session rate | shows pressure on step-up flow |
Core model and policy metrics¶
| Metric | Why it matters |
|---|---|
| live score distribution | shows drift in genuine traffic |
| spoof score distribution | shows whether attacks are getting harder |
| calibration drift | thresholds may be aging |
| per-segment pass/retry/reject | catches hidden regressions |
| model disagreement rate | useful in fusion systems |
Core infrastructure metrics¶
| Metric | Why it matters |
|---|---|
| p50 / p95 / p99 latency | affects user experience |
| request failure rate | API stability |
| timeout rate | can look like model failure |
| SDK crash or camera error rate | client reliability |
| platform-specific failure rate | device and browser health |
Dashboard slices that matter¶
Dashboards are more useful when they can be segmented by:
- flow type
- platform
- device class
- browser family
- SDK version
- app version
- geography if relevant
- model version
- policy version
Without slicing, major problems stay hidden inside averages.
Alerting examples¶
| Alert example | Why it matters |
|---|---|
| retry rate jumps 30% on web | likely UX, threshold, or browser regression |
| spoof acceptance spikes in one channel | possible active attack campaign |
| p95 latency doubles after release | infrastructure or model-load issue |
| model disagreement jumps sharply | possible calibration or fusion regression |
| one SDK version has high capture failure | client release quality issue |
Drift to watch for¶
Not all drift is fraud. Some drift is normal environmental change.
Useful drift categories:
- traffic mix drift
- device mix drift
- score distribution drift
- quality drift
- attack-pattern drift
- seasonal or campaign-based drift
The goal is to separate normal movement from dangerous change.
Incident handling playbook¶
When a serious issue appears, teams should know the response path.
Example incident flow¶
- confirm the signal is real
- identify affected channels and versions
- classify as fraud, model, policy, SDK, or infrastructure issue
- reduce impact with rollback or policy change if needed
- run focused error analysis
- document follow-up actions and owners
Release gating and operations¶
Monitoring works best when tied to release policy.
Before a major release, define:
- key launch metrics
- rollback thresholds
- who approves release
- how long the guarded rollout lasts
- what traffic slices will be watched first
More on this is covered in 19. Model Governance.
What to log for safe operations¶
A useful operational record often includes:
- request ID
- channel and platform
- SDK / app / model version
- final decision
- intermediate scores or score bands
- quality signals
- latency
- retry count
- security signal summary
Keep privacy policy and retention rules in mind when designing logs.
Example monitoring event¶
A monitoring-friendly event should preserve enough context to explain what happened without exposing more data than necessary.
{
"request_id": "req_34b9",
"timestamp": "2026-03-18T11:42:09Z",
"flow_type": "transaction_approval",
"platform": "web",
"browser_family": "chrome",
"sdk_version": "web-2.4.1",
"model_version": "fusion-0.6.4",
"policy_version": "policy-2026-03-10",
"decision": "retry",
"decision_band": "uncertain",
"latency_ms": 1288,
"quality": {
"blur_score": 0.31,
"brightness_score": 0.42
},
"scores": {
"passive": 0.58,
"active": 0.62,
"fusion": 0.61
},
"security": {
"virtual_camera_signal": false,
"injection_signal": false
}
}
This kind of event supports dashboards, incident review, and threshold tuning.
Example alert thresholds¶
| Signal | Example trigger | Typical first action |
|---|---|---|
| retry rate | +25% vs trailing 7-day baseline | check quality, browser, and threshold slices |
| p95 latency | above 2x normal for 15 minutes | inspect infrastructure, model load, and timeouts |
| disagreement rate | above normal after release | compare calibration and fusion versions |
| spoof acceptance proxy | spike in one flow or geography | tighten policy, review sessions, escalate fraud review |
| camera failures | one client version spikes | pause rollout or hotfix SDK |
Need term help?¶
For terms used on this page, keep these references nearby:
Common operational mistakes¶
| Mistake | Why it hurts |
|---|---|
| only monitoring overall pass rate | hides device and channel problems |
| not versioning thresholds and policy | makes regression analysis harder |
| no alert for disagreement or drift | fusion failures can stay invisible |
| no rollback plan | incidents take longer to control |
| keeping no post-launch review cadence | issues accumulate quietly |
Final takeaway¶
A strong liveness deployment needs more than a strong model.
It needs:
- clear metrics
- segmented dashboards
- alerting
- incident response
- release discipline
- a feedback loop into data and model improvement
That is what turns a model into a reliable service.
Related docs¶
Read next¶
Go to 17. Security Hardening.