13. Dataset Strategy¶

Who should read this page¶

This page is mainly for ML engineers, data teams, QA teams, and technical leads planning training, validation, benchmarking, or fusion work.

Why this page exists¶

Many liveness projects fail because the model is trained on data that does not match the real problem.

A good dataset strategy should answer questions like:

Which attacks are we trying to stop?
Which channels and devices do we support?
How do we avoid train/test leakage?
What extra data is needed for fusion and calibration?

Start from the problem, not from the data you happen to have¶

A useful dataset strategy begins with the target deployment:

onboarding, login, recovery, or transaction approval
app, web, kiosk, or assisted flow
passive, active, or hybrid liveness
expected attack types
expected device mix

A dataset that is perfect for mobile onboarding may still be weak for browser-based step-up authentication.

Think in layers of data¶

A strong program often has more than one dataset.

Dataset layer	Purpose
training set	teach the base model or fusion layer
validation set	tune thresholds, hyperparameters, and policies
test set	final offline performance check
challenge set	stress test rare or hard attacks
monitoring sample	compare production drift later

What the core dataset should include¶

Live data¶

Live data should include realistic variation in:

age groups
presentation styles
lighting
pose
expression
camera quality
capture distance
background complexity
browser and app environments

Spoof data¶

Spoof data should include the attacks relevant to your threat model, such as:

print
replay on screen
partial replay or cropped replay
mask or 3D props where relevant
injection or virtual-camera attacks
AI-generated or manipulated content where relevant

Diversity matters more than raw count alone¶

A large dataset with only one clean device setup is weaker than a smaller dataset with realistic diversity.

Important diversity dimensions:

Dimension	Examples
device	low-end Android, flagship Android, iPhone, laptop webcam
channel	app, mobile web, desktop web
environment	bright indoor, dim indoor, outdoor mixed light
attack execution	different screen brightness, print sizes, attack distances
operational flow	onboarding, login step-up, recovery

Split strategy is critical¶

Leakage can make a weak model look excellent.

Minimum split rules¶

keep people disjoint across train, validation, and test when possible
keep near-duplicate captures out of multiple splits
keep attack sessions disjoint where practical
keep calibration data separate from final test data

Stronger split patterns¶

Split style	Why it matters
person-disjoint	avoids memorizing users
device-aware split	checks generalization across devices
attack-scenario split	checks generalization to new spoof execution styles
time-based split	useful for drift and release simulation

Fusion dataset needs a tabular view¶

For a fusion or meta-model, the dataset should not be only image files plus labels.

It should also store the outputs of all upstream systems.

Example fusion training schema¶

Field	Description
`sample_id`	capture or request identifier
`person_id`	identity grouping field
`session_id`	session grouping field
`label`	live or spoof
`attack_type`	print, replay, injection, deepfake, etc.
`flow_type`	onboarding, login, recovery, transaction
`platform`	android, ios, web
`device_class`	low, mid, high
`lighting_bucket`	bright, dim, backlit
`model_a_score`	base-model output
`model_b_score`	base-model output
`quality_score`	capture quality score
`blur_score`	blur measure
`brightness_score`	brightness measure
`challenge_passed`	active-challenge result

This is the dataset a fusion layer actually learns from.

Suggested data packaging¶

A practical repo layout often looks like this:

captures/
  images_or_video/
labels/
  sample_labels.csv
metadata/
  capture_metadata.csv
model_outputs/
  passive_scores.parquet
  active_scores.parquet
  quality_scores.parquet
splits/
  train.csv
  val.csv
  test.csv

This makes it easier to rebuild experiments and audit mistakes.

Labeling strategy¶

Useful labels usually include more than just live and spoof.

Recommended labels¶

final label: live / spoof
attack type
attack family
capture quality bucket
failure reason when known
uncertain / ambiguous flag
device and channel metadata

More detail is covered in Appendix A9 — Data Collection and Labeling.

Real vs synthetic data¶

Synthetic data can help, but it should not silently replace realistic evaluation.

Data type	Good use	Main caution
real live data	realistic capture variation	privacy and collection cost
real spoof data	realistic threat measurement	expensive to collect broadly
synthetic attack data	stress testing and scale	may not match real attacker behavior
synthetic quality degradation	robustness training	easy to overuse unrealistically

A healthy program usually uses synthetic data to expand coverage, not to avoid collecting real examples.

A practical minimum plan for a new team¶

If the program is early, a useful first plan is:

collect clean live data across target channels
collect the top three relevant attack types well
capture realistic device and lighting variation
enforce person-disjoint splits
store per-sample metadata and model outputs from the start

This is more valuable than collecting a huge but messy dataset.

Common mistakes¶

Mistake	Why it hurts
same person appears in train and test	inflates results
only clean indoor data	poor field performance
no attack taxonomy	weak analysis later
no metadata storage	impossible to segment and debug
no uncertainty bucket	noisy labels pollute training
no versioned splits	results become hard to reproduce

Questions to answer before collecting more data¶

Which attacks matter most in the next release?
Which devices or channels are under-covered?
Are low-quality failures hurting live users more than spoof attacks?
Is the fusion layer blocked by missing metadata rather than missing images?
Are we collecting data that helps the real problem, or just easy data?

Final takeaway¶

A strong liveness dataset strategy is not just about more samples.

It is about:

the right threat coverage
the right channel and device coverage
clean split rules
useful metadata
reproducible packaging

That is what turns data into a reliable engineering asset.

Need term help?¶

If any technical terms on this page feel dense, use Appendix A1 — Key Terms first and then jump to the relevant appendix page for deeper detail.