14. Score Calibration and Thresholding¶
Who should read this page¶
This page is mainly for ML engineers, platform engineers, risk teams, and anyone deciding how model scores should turn into pass, retry, or fail decisions.
Why this page exists¶
A model score is not a business decision.
A score must be:
- interpreted correctly
- calibrated when needed
- converted into thresholds or bands
- tested against the real use case
This matters even more when multiple models are combined.
Raw score vs calibrated score¶
Raw score¶
A raw score is the direct output of a model.
Calibrated score¶
A calibrated score is adjusted so it better matches observed probability or decision behavior.
Two models may both output 0.80, but those numbers may not mean the same thing.
Why calibration matters in liveness¶
Calibration helps when:
- different models use different score scales
- model retraining shifts score distributions
- one channel behaves differently from another
- fusion needs comparable signals
- risk teams want consistent decision bands
A practical calibration flow¶
flowchart TB
A[Collect holdout<br/>scores] --> B[Check live vs spoof<br/>distributions]
B --> C[Fit calibration<br/>method]
C --> D[Validate by segment]
D --> E[Set thresholds<br/>or bands]
E --> F[Monitor drift<br/>after release]
Common calibration methods¶
| Method | Good for | Main caution |
|---|---|---|
| min-max normalization | simple score scaling | does not fix probability meaning |
| z-score normalization | stable internal comparisons | weak when distributions drift |
| Platt scaling | smooth probability mapping | assumes sigmoid-like behavior |
| isotonic regression | flexible calibration | can overfit on small data |
| quantile banding | practical risk bands | less precise as probability |
For many production systems, a simple and well-tested calibration approach is more useful than a mathematically fancy one.
Thresholds should match the use case¶
Different use cases often need different thresholds.
| Use case | Typical goal | Usual bias |
|---|---|---|
| account opening | strong fraud control with reasonable completion | slightly stricter |
| login step-up | strong usability with targeted security | balanced |
| high-value transaction | low attack acceptance | stricter |
| account recovery | strong fraud control and auditability | strict plus fallback |
Do not assume one threshold is correct everywhere.
Score bands are often better than one hard threshold¶
A simple banded policy is easier to operate.
| Band | Meaning | Typical action |
|---|---|---|
| high confidence live | strong evidence of genuine user | accept |
| uncertain | mixed or weak evidence | retry or step-up |
| high risk spoof | strong spoof evidence or severe security flag | reject or review |
This gives the system room to treat ambiguity differently from clear fraud.
Example threshold policy¶
If calibrated_score >= 0.85 and quality is acceptable:
accept
If 0.55 <= calibrated_score < 0.85:
retry once or trigger stronger challenge
If calibrated_score < 0.55:
reject or route to manual review depending on flow risk
The exact numbers should come from local evaluation, not from generic advice.
Segment-aware calibration¶
A good calibration review should check whether scores behave differently across:
- platform
- device class
- app vs web
- low light vs bright light
- attack type
- model version
If one segment behaves very differently, the answer may be:
- segment-specific calibration
- segment-specific threshold
- or a product decision to reduce exposure in that segment
Calibration for fusion systems¶
When multiple models feed a fusion layer:
- calibrate each model individually when needed
- validate score stability by segment
- combine the calibrated signals
- then set the final fusion thresholds
This is usually safer than combining raw scores directly.
Threshold migration across releases¶
Thresholds should be versioned with the model or policy.
A useful release question is:
If the model changed, do the old thresholds still behave the same way?
Many teams miss this and accidentally shift user experience or fraud exposure after a model update.
Retry threshold vs reject threshold¶
A common mistake is treating every low-confidence result as fraud.
Poor quality and spoof are not the same thing.
A stronger policy separates them:
- retry threshold for uncertain cases
- reject threshold for high-risk spoof cases
- separate quality-gate threshold where needed
What to monitor after release¶
Calibration work is not finished at launch.
Monitor:
- score distribution shift
- pass / retry / reject rates
- segment-specific drift
- attack-type failures
- release-to-release threshold behavior
This is closely tied to 16. Monitoring and Operations.
Common mistakes¶
| Mistake | Why it is risky |
|---|---|
| using one threshold for all channels | hides channel-specific behavior |
| averaging uncalibrated scores | creates unstable fusion |
| copying thresholds from a vendor demo | rarely fits local risk |
| recalibrating on test data | contaminates final evaluation |
| treating quality failures as spoof | hurts user experience and analysis |
Final takeaway¶
Good thresholding is not just about picking a number.
It is about:
- understanding score meaning
- calibrating where needed
- mapping score to policy
- validating by segment
- monitoring drift after release
That is what turns model output into a trustworthy decision process.
Need term help?¶
If any technical terms on this page feel dense, use Appendix A1 — Key Terms first and then jump to the relevant appendix page for deeper detail.
Related docs¶
Read next¶
Go to 15. Error Analysis.