08. Evaluation Playbook¶

Who should read this page¶

This page is mainly for ML engineers, evaluators, QA teams, fraud teams, and decision-makers who need to know whether a liveness system is actually ready for production.

Why this page exists¶

Many face liveness discussions stop at model training or a few benchmark numbers. That is not enough for a real deployment.

Evaluation should answer a harder question:

How well does the system behave on the attacks, devices, conditions, and users that matter in our environment?

What good evaluation should cover¶

A strong evaluation should cover at least these dimensions:

genuine live traffic conditions
attack diversity
device diversity
capture quality variation
environment variation
fairness and accessibility impact
operational metrics such as latency and retry rate

A simple evaluation flow¶

flowchart TB
    A[Define use case<br/>and threat model] --> B[Build evaluation<br/>dataset]
    B --> C[Choose metrics]
    C --> D[Run tests by<br/>segment]
    D --> E[Calibrate thresholds]
    E --> F[Review failures]
    F --> G[Approve, improve,<br/>or block release]

Step 1: Define the evaluation question¶

Before measuring anything, define the use case clearly.

Examples:

account opening on Android and iPhone app
browser onboarding on lower-end webcams
high-risk transaction step-up with short selfie video
account recovery with passive liveness only

The same model can behave very differently across these settings.

Step 2: Build the right dataset¶

Include both bona fide and attack data¶

At minimum, your dataset should contain:

genuine live captures
print attacks
replay attacks
injection-style attacks if relevant
mask or 3D attack types where risk exists
AI-generated or manipulated content where relevant

Include realistic variation¶

Do not make the dataset too clean.

Include variation in:

lighting
camera quality
blur
pose
occlusion
background
network or compression effects when relevant

Step 3: Segment the data¶

Overall averages are not enough.

Useful segmentation examples:

Segment type	Example
device	low-end Android, flagship Android, iPhone, laptop webcam
channel	mobile app, mobile web, desktop web
environment	indoor bright, low light, outdoor mixed light
attack type	print, replay, injection, deepfake
user journey	onboarding, login step-up, recovery

A model can look good overall while failing badly in one segment.

Step 4: Use metrics that matter¶

Core PAD metrics¶

APCER: how often attack presentations are incorrectly accepted
BPCER: how often genuine users are incorrectly rejected
ACER: average of APCER and BPCER

Operational metrics¶

latency
retry rate
completion rate
manual review rate
device-specific failure rate

Why both matter¶

A model can look strong on ACER but still create a bad user journey if retry or completion rates are poor.

Step 5: Calibrate thresholds¶

Threshold choice is not just a model question. It is a business and risk decision.

Good threshold calibration process¶

evaluate score distributions on live and spoof data
compare results by segment
choose a threshold or score bands for the target use case
test the policy with retry logic included
review business impact before release

Step 6: Review failures, not just metrics¶

Look at examples where the system failed and group them by reason.

Common failure buckets¶

low light
blur
occlusion
reflective screen replay
weak browser capture quality
unusual camera angle
model confusion on AI-generated content

Failure review often teaches more than one more decimal point in a benchmark table.

Example test matrix¶

Dimension	Example values
app channel	Android app, iOS app, web
device class	low-end, mid-range, high-end
capture type	still image, short video
attack type	print, replay, injection, deepfake
environment	bright indoor, dim indoor, outdoor

This kind of matrix helps teams see coverage gaps early.

Release readiness checklist¶

What not to do¶

Weak practice	Why it is risky
testing only on clean lab data	hides real-world failure modes
using only one overall metric	hides segment weakness
copying another team's threshold	may be wrong for your use case
ignoring browser and low-end devices	causes production surprises
skipping post-launch re-evaluation	attackers and data drift change over time

After launch: evaluation never fully stops¶

Real systems need ongoing checks.

Monitor for:

score distribution drift
sudden changes in retry rate
model regressions after updates
new attack patterns
segment-specific degradation

Evaluation should continue after deployment, not end there.

Final takeaway¶

A good liveness evaluation is not just “How accurate is the model?”

It is:

how accurate on our attacks
how accurate on our devices
how accurate in our environments
how usable for our customers
how stable after release

That is what makes evaluation useful.