Skip to content

Face Liveness Detection Overview

Definition

Face liveness detection (also called Presentation Attack Detection or PAD) determines whether the face presented to the camera is a real, live human being — as opposed to a spoofed representation such as a printed photo, a screen replay, a 3D mask, or a deepfake.

It is arguably the most critical AI component in eKYC. If liveness fails, an attacker with a stolen ID can use the victim's photo to pass face matching — making the entire verification meaningless.


Why Liveness is Critical

graph TD
    A[Attacker has victim's ID document] --> B[ID passes document verification ✅]
    B --> C{Liveness Check?}
    C -->|No liveness| D[Attacker holds up photo of victim]
    D --> E[Face matches ID photo ✅]
    E --> F[❌ Account opened fraudulently]

    C -->|With liveness| G[Attacker's spoof detected]
    G --> H[✅ Attack blocked]

    style F fill:#e53935,color:#fff
    style H fill:#2E7D32,color:#fff

Liveness vs Face Recognition

Aspect Face Recognition Face Liveness
Question Is this the same person as on the ID? Is this a live, real person?
Input Face image → embedding → comparison Face image/video → real/spoof classification
Output Similarity score (0-1) Binary: live or spoof (with confidence)
Threat model Wrong person Right person's photo/video/mask/deepfake
Complementary Useless without liveness Useless without recognition

Attack Taxonomy

graph TD
    A[Spoofing Attacks] --> B[Presentation Attacks<br/>Physical artifacts held to camera]
    A --> C[Injection Attacks<br/>Digital manipulation of camera feed]

    B --> B1[2D Attacks]
    B --> B2[3D Attacks]

    B1 --> B1a[Print attack - photo on paper]
    B1 --> B1b[Screen replay - photo/video on device]
    B1 --> B1c[Cut-out attack - face cut from photo]

    B2 --> B2a[Silicone mask]
    B2 --> B2b[Resin/3D-printed mask]
    B2 --> B2c[Flexible mask - latex, paper-craft]

    C --> C1[Virtual camera - OBS, ManyCam]
    C --> C2[API injection - bypass SDK]
    C --> C3[Emulator - Android emulator]
    C --> C4[Real-time deepfake - face swap]
    C --> C5[App hooking - Frida, Xposed]

    style B fill:#F57F17,color:#000
    style C fill:#e53935,color:#fff

Detection Approaches

Passive Liveness

Analyzes a single image or short video without requiring the user to perform any action:

Feature Type What It Detects Model Approach
Texture analysis Moiré patterns, print dots, screen pixels CNN feature extraction
Color analysis Color distribution differences (real skin vs paper/screen) Color space analysis (HSV, YCbCr)
Frequency analysis High-frequency artifacts from printing/display Fourier/wavelet features
Depth cues Flatness of 2D attacks Monocular depth estimation
Reflection Specular reflections differ between skin and paper/screen Reflection pattern analysis
Edge artifacts Boundaries between face and spoof medium Edge gradient analysis

Active Liveness

Requires the user to perform specific actions (challenges):

Challenge What It Proves User Action
Blink Face can blink naturally Close and open eyes
Head turn Face has 3D structure, responds to instructions Turn head left/right
Smile Face can change expression Smile naturally
Random gaze Eyes track a moving target Follow on-screen dot
Illumination response Skin reflects light differently than paper/screen Screen flashes colors, observe face response

Comparison

Aspect Passive Active
User experience Seamless — just look at camera Requires following instructions
Speed < 1 second 3-10 seconds
Spoof resistance Good (with strong models) Higher (harder to replay)
Deepfake resistance Moderate Higher (dynamic actions)
Accessibility Better (no action needed) Challenging for some disabilities
Failure rate Lower Higher (users confused by challenges)
Industry trend Growing preference Declining (UX concerns)

Model Architecture Overview

Typical Pipeline

graph TD
    A[Face Image 224×224] --> B[Backbone]
    B --> C[Feature Maps]
    C --> D[Classification Head]
    D --> E["Output: P(real), P(spoof)"]

    B -.->|Options| F[ResNet-18/34]
    B -.->|Options| G[EfficientNet-B0/B1]
    B -.->|Options| H[MobileNetV3]
    B -.->|Options| I[ViT-Small/Base]
    B -.->|Options| J[CDCN]

    style E fill:#2E7D32,color:#fff

Key Architectures

Architecture Type Key Feature Params
CDCN CNN Central Difference Convolution — captures fine-grained patterns 2M
ResNet-18 + binary head CNN Simple, proven baseline 11M
EfficientNet-B0 CNN Good accuracy-efficiency tradeoff 5M
ViT-Small Transformer Self-attention captures global patterns 22M
FLIP-MCL Hybrid Foundation model with multimodal contrastive learning

Auxiliary Supervision

Instead of just binary classification, modern models add auxiliary tasks during training:

Auxiliary Task What It Adds Benefit
Depth map estimation Predict face depth (real face = 3D, spoof = flat) Provides geometric reasoning
Reflection map Predict specular reflection patterns Captures material properties
Binary mask Predict which pixels are real face vs spoof medium Fine-grained spatial understanding
Domain label Predict which dataset the sample came from Encourages domain-invariant features

The Generalization Problem

The biggest challenge in face liveness: models trained on known attacks fail on unseen attacks.

graph LR
    A["Trained on:<br/>OULU-NPU<br/>(Nokia phones, 2 printers)"] -->|Test on| B["OULU-NPU Test<br/>ACER: 1-3%<br/>✅ Great"]
    A -->|Test on| C["CASIA-FASD<br/>ACER: 15-25%<br/>❌ Poor"]
    A -->|Test on| D["Real-world attacks<br/>ACER: ???<br/>❌ Unknown"]

    style B fill:#2E7D32,color:#fff
    style C fill:#e53935,color:#fff
    style D fill:#e53935,color:#fff

Why this happens:

  • Models overfit to specific attack instruments (specific printers, screens)
  • Models overfit to specific capture devices (specific phones, cameras)
  • Lab conditions don't represent real-world diversity
  • New attack types (deepfakes) not in training data

Solutions:

Approach How It Helps
Domain generalization Train to be invariant to source domain
Self-supervised pretraining Learn robust representations without labels
Diverse training data Cover many attack types, devices, environments
Synthetic data augmentation Generate novel attacks for training
Test-time adaptation Adapt to new domains at inference time

See: Domain Generalization for Liveness


Liveness in the eKYC Pipeline

graph TD
    A[Selfie Captured] --> B[Face Detection]
    B --> C[Face Alignment]
    C --> D[Face Quality Check]
    D -->|Quality OK| E["Liveness Model<br/>P(real) vs P(spoof)"]

    E --> F{P(real) > threshold?}
    F -->|Yes - Live| G[Continue to face matching]
    F -->|No - Spoof| H[Reject + alert]

    F -->|Borderline| I[Escalate to V-KYC or retry]

    style G fill:#2E7D32,color:#fff
    style H fill:#e53935,color:#fff

Threshold Setting

Threshold Strategy FAR FRR Use Case
Low threshold (0.3) Higher (more spoofs pass) Lower (fewer real users rejected) Convenience-first
Balanced (0.5) Moderate Moderate Standard eKYC
High threshold (0.7) Lower (fewer spoofs pass) Higher (more real users rejected) High-security
Very high (0.9) Very low Very high Critical systems (border control)

Key Takeaways

Summary

  • Face liveness is the most critical AI component in eKYC — without it, face matching is meaningless
  • Attacks range from simple prints to sophisticated real-time deepfakes
  • Passive liveness (single image) is trending over active (challenge-response) for better UX
  • Domain generalization is the biggest open challenge — models fail on unseen attacks
  • Auxiliary supervision (depth maps, reflection maps) improves robustness beyond binary classification
  • Threshold tuning balances security (low FAR) against usability (low FRR)
  • The field is rapidly evolving — new attacks and defenses emerge continuously