Skip to content

Liveness Datasets

Definition

This article catalogues the major datasets used for training and evaluating face liveness / presentation attack detection models. Dataset diversity and quality directly determine how well a liveness model generalizes to real-world attacks.


Major Datasets

Dataset Year Subjects Videos/Images Attack Types Capture Devices Key Feature
OULU-NPU 2017 55 5,940 videos Print, replay 6 phones 4 protocols (cross-device, cross-attack)
CASIA-FASD 2012 50 600 videos Print, replay 3 cameras Early benchmark, still used
Replay-Attack 2012 50 1,300 clips Print, replay (phone, tablet) 1 camera Classic benchmark
SiW 2018 165 4,620 videos Print, replay 1 camera Large scale, high quality
SiW-M 2019 493 1,630 videos 13 attack types (masks, makeup, partial) 1 camera Most diverse attack types
CelebA-Spoof 2020 10,177 625K images Print, replay, 3D mask Multiple Largest dataset
WMCA 2019 72 1,679 videos Print, replay, mask (rigid, flex, paper) Multi-spectral (RGB, depth, NIR, thermal) Multi-modal
HiFiMask 2021 75 54K videos High-fidelity 3D masks (resin, plaster, transparent) 7 cameras Most realistic mask attacks
CeFA 2020 1,607 30K videos Print, replay, 3D mask Multi-spectral Cross-ethnicity focus
CASIA-SURF 2019 1,000 21K videos Print, cut-out Multi-modal (RGB, depth, IR) Multi-modal

Cross-Dataset Evaluation Protocols

The standard way to evaluate liveness model generalization:

Leave-One-Out Protocol

Train on 3 datasets, test on the 4th (using OULU-NPU [O], CASIA-FASD [C], Idiap Replay [I], MSU-MFSD [M]):

Protocol Train Test What It Evaluates
O&C&I → M OULU + CASIA + Idiap MSU Generalize to unseen capture setup
O&M&I → C OULU + MSU + Idiap CASIA Generalize to different camera
O&C&M → I OULU + CASIA + MSU Idiap Generalize to different environment
I&C&M → O Idiap + CASIA + MSU OULU Generalize to mobile devices

Why Cross-Dataset Testing Matters

Test Type What It Shows Typical ACER
Intra-dataset (same dataset train/test) Model memorization ability 1-3%
Cross-dataset (different dataset) Real generalization ability 10-30%
Real-world (production attacks) Actual deployment performance Unknown (often worse)

Dataset Selection for Training

Goal Recommended Datasets
Baseline training OULU-NPU + CASIA-FASD
Diverse attacks Add SiW-M (13 attack types)
3D mask robustness Add HiFiMask or WMCA
Scale CelebA-Spoof (625K images)
Cross-ethnicity CeFA (multi-ethnicity)
Multi-modal WMCA or CASIA-SURF (RGB + Depth + NIR)

Key Takeaways

Summary

  • OULU-NPU is the gold standard benchmark with 4 well-defined protocols
  • Cross-dataset evaluation reveals true generalization — intra-dataset results are misleading
  • CelebA-Spoof (625K images) is the largest, but OULU-NPU protocols remain most cited
  • HiFiMask provides the most realistic 3D mask attacks
  • Dataset diversity (devices, attack types, ethnicities) is more important than dataset size
  • For production models, train on as many diverse datasets as possible