Liveness Datasets¶

Definition¶

This article catalogues the major datasets used for training and evaluating face liveness / presentation attack detection models. Dataset diversity and quality directly determine how well a liveness model generalizes to real-world attacks.

Major Datasets¶

Dataset	Year	Subjects	Videos/Images	Attack Types	Capture Devices	Key Feature
OULU-NPU	2017	55	5,940 videos	Print, replay	6 phones	4 protocols (cross-device, cross-attack)
CASIA-FASD	2012	50	600 videos	Print, replay	3 cameras	Early benchmark, still used
Replay-Attack	2012	50	1,300 clips	Print, replay (phone, tablet)	1 camera	Classic benchmark
SiW	2018	165	4,620 videos	Print, replay	1 camera	Large scale, high quality
SiW-M	2019	493	1,630 videos	13 attack types (masks, makeup, partial)	1 camera	Most diverse attack types
CelebA-Spoof	2020	10,177	625K images	Print, replay, 3D mask	Multiple	Largest dataset
WMCA	2019	72	1,679 videos	Print, replay, mask (rigid, flex, paper)	Multi-spectral (RGB, depth, NIR, thermal)	Multi-modal
HiFiMask	2021	75	54K videos	High-fidelity 3D masks (resin, plaster, transparent)	7 cameras	Most realistic mask attacks
CeFA	2020	1,607	30K videos	Print, replay, 3D mask	Multi-spectral	Cross-ethnicity focus
CASIA-SURF	2019	1,000	21K videos	Print, cut-out	Multi-modal (RGB, depth, IR)	Multi-modal

Cross-Dataset Evaluation Protocols¶

The standard way to evaluate liveness model generalization:

Leave-One-Out Protocol¶

Train on 3 datasets, test on the 4th (using OULU-NPU [O], CASIA-FASD [C], Idiap Replay [I], MSU-MFSD [M]):

Protocol	Train	Test	What It Evaluates
O&C&I → M	OULU + CASIA + Idiap	MSU	Generalize to unseen capture setup
O&M&I → C	OULU + MSU + Idiap	CASIA	Generalize to different camera
O&C&M → I	OULU + CASIA + MSU	Idiap	Generalize to different environment
I&C&M → O	Idiap + CASIA + MSU	OULU	Generalize to mobile devices

Why Cross-Dataset Testing Matters¶

Test Type	What It Shows	Typical ACER
Intra-dataset (same dataset train/test)	Model memorization ability	1-3%
Cross-dataset (different dataset)	Real generalization ability	10-30%
Real-world (production attacks)	Actual deployment performance	Unknown (often worse)

Dataset Selection for Training¶

Goal	Recommended Datasets
Baseline training	OULU-NPU + CASIA-FASD
Diverse attacks	Add SiW-M (13 attack types)
3D mask robustness	Add HiFiMask or WMCA
Scale	CelebA-Spoof (625K images)
Cross-ethnicity	CeFA (multi-ethnicity)
Multi-modal	WMCA or CASIA-SURF (RGB + Depth + NIR)

Key Takeaways¶

Summary

OULU-NPU is the gold standard benchmark with 4 well-defined protocols
Cross-dataset evaluation reveals true generalization — intra-dataset results are misleading
CelebA-Spoof (625K images) is the largest, but OULU-NPU protocols remain most cited
HiFiMask provides the most realistic 3D mask attacks
Dataset diversity (devices, attack types, ethnicities) is more important than dataset size
For production models, train on as many diverse datasets as possible

Previous: ← Liveness Model Architectures
Next: Domain Generalization for Liveness →
Face Liveness Detection Overview
Presentation Attack Types