Face Liveness Detection Overview¶

Definition¶

Face liveness detection (also called Presentation Attack Detection or PAD) determines whether the face presented to the camera is a real, live human being — as opposed to a spoofed representation such as a printed photo, a screen replay, a 3D mask, or a deepfake.

It is arguably the most critical AI component in eKYC. If liveness fails, an attacker with a stolen ID can use the victim's photo to pass face matching — making the entire verification meaningless.

Why Liveness is Critical¶

graph TD
    A[Attacker has victim's ID document] --> B[ID passes document verification ✅]
    B --> C{Liveness Check?}
    C -->|No liveness| D[Attacker holds up photo of victim]
    D --> E[Face matches ID photo ✅]
    E --> F[❌ Account opened fraudulently]

    C -->|With liveness| G[Attacker's spoof detected]
    G --> H[✅ Attack blocked]

    style F fill:#e53935,color:#fff
    style H fill:#2E7D32,color:#fff

Liveness vs Face Recognition¶

Aspect	Face Recognition	Face Liveness
Question	Is this the same person as on the ID?	Is this a live, real person?
Input	Face image → embedding → comparison	Face image/video → real/spoof classification
Output	Similarity score (0-1)	Binary: live or spoof (with confidence)
Threat model	Wrong person	Right person's photo/video/mask/deepfake
Complementary	Useless without liveness	Useless without recognition

Attack Taxonomy¶

graph TD
    A[Spoofing Attacks] --> B[Presentation Attacks<br/>Physical artifacts held to camera]
    A --> C[Injection Attacks<br/>Digital manipulation of camera feed]

    B --> B1[2D Attacks]
    B --> B2[3D Attacks]

    B1 --> B1a[Print attack - photo on paper]
    B1 --> B1b[Screen replay - photo/video on device]
    B1 --> B1c[Cut-out attack - face cut from photo]

    B2 --> B2a[Silicone mask]
    B2 --> B2b[Resin/3D-printed mask]
    B2 --> B2c[Flexible mask - latex, paper-craft]

    C --> C1[Virtual camera - OBS, ManyCam]
    C --> C2[API injection - bypass SDK]
    C --> C3[Emulator - Android emulator]
    C --> C4[Real-time deepfake - face swap]
    C --> C5[App hooking - Frida, Xposed]

    style B fill:#F57F17,color:#000
    style C fill:#e53935,color:#fff

Detection Approaches¶

Passive Liveness¶

Analyzes a single image or short video without requiring the user to perform any action:

Feature Type	What It Detects	Model Approach
Texture analysis	Moiré patterns, print dots, screen pixels	CNN feature extraction
Color analysis	Color distribution differences (real skin vs paper/screen)	Color space analysis (HSV, YCbCr)
Frequency analysis	High-frequency artifacts from printing/display	Fourier/wavelet features
Depth cues	Flatness of 2D attacks	Monocular depth estimation
Reflection	Specular reflections differ between skin and paper/screen	Reflection pattern analysis
Edge artifacts	Boundaries between face and spoof medium	Edge gradient analysis

Active Liveness¶

Requires the user to perform specific actions (challenges):

Challenge	What It Proves	User Action
Blink	Face can blink naturally	Close and open eyes
Head turn	Face has 3D structure, responds to instructions	Turn head left/right
Smile	Face can change expression	Smile naturally
Random gaze	Eyes track a moving target	Follow on-screen dot
Illumination response	Skin reflects light differently than paper/screen	Screen flashes colors, observe face response

Comparison¶

Aspect	Passive	Active
User experience	Seamless — just look at camera	Requires following instructions
Speed	< 1 second	3-10 seconds
Spoof resistance	Good (with strong models)	Higher (harder to replay)
Deepfake resistance	Moderate	Higher (dynamic actions)
Accessibility	Better (no action needed)	Challenging for some disabilities
Failure rate	Lower	Higher (users confused by challenges)
Industry trend	Growing preference	Declining (UX concerns)

Model Architecture Overview¶

Typical Pipeline¶

graph TD
    A[Face Image 224×224] --> B[Backbone]
    B --> C[Feature Maps]
    C --> D[Classification Head]
    D --> E["Output: P(real), P(spoof)"]

    B -.->|Options| F[ResNet-18/34]
    B -.->|Options| G[EfficientNet-B0/B1]
    B -.->|Options| H[MobileNetV3]
    B -.->|Options| I[ViT-Small/Base]
    B -.->|Options| J[CDCN]

    style E fill:#2E7D32,color:#fff

Key Architectures¶

Architecture	Type	Key Feature	Params
CDCN	CNN	Central Difference Convolution — captures fine-grained patterns	2M
ResNet-18 + binary head	CNN	Simple, proven baseline	11M
EfficientNet-B0	CNN	Good accuracy-efficiency tradeoff	5M
ViT-Small	Transformer	Self-attention captures global patterns	22M
FLIP-MCL	Hybrid	Foundation model with multimodal contrastive learning	—

Auxiliary Supervision¶

Instead of just binary classification, modern models add auxiliary tasks during training:

Auxiliary Task	What It Adds	Benefit
Depth map estimation	Predict face depth (real face = 3D, spoof = flat)	Provides geometric reasoning
Reflection map	Predict specular reflection patterns	Captures material properties
Binary mask	Predict which pixels are real face vs spoof medium	Fine-grained spatial understanding
Domain label	Predict which dataset the sample came from	Encourages domain-invariant features

The Generalization Problem¶

The biggest challenge in face liveness: models trained on known attacks fail on unseen attacks.

graph LR
    A["Trained on:<br/>OULU-NPU<br/>(Nokia phones, 2 printers)"] -->|Test on| B["OULU-NPU Test<br/>ACER: 1-3%<br/>✅ Great"]
    A -->|Test on| C["CASIA-FASD<br/>ACER: 15-25%<br/>❌ Poor"]
    A -->|Test on| D["Real-world attacks<br/>ACER: ???<br/>❌ Unknown"]

    style B fill:#2E7D32,color:#fff
    style C fill:#e53935,color:#fff
    style D fill:#e53935,color:#fff

Why this happens:

Models overfit to specific attack instruments (specific printers, screens)
Models overfit to specific capture devices (specific phones, cameras)
Lab conditions don't represent real-world diversity
New attack types (deepfakes) not in training data

Solutions:

Approach	How It Helps
Domain generalization	Train to be invariant to source domain
Self-supervised pretraining	Learn robust representations without labels
Diverse training data	Cover many attack types, devices, environments
Synthetic data augmentation	Generate novel attacks for training
Test-time adaptation	Adapt to new domains at inference time

See: Domain Generalization for Liveness

Liveness in the eKYC Pipeline¶

graph TD
    A[Selfie Captured] --> B[Face Detection]
    B --> C[Face Alignment]
    C --> D[Face Quality Check]
    D -->|Quality OK| E["Liveness Model<br/>P(real) vs P(spoof)"]

    E --> F{P(real) > threshold?}
    F -->|Yes - Live| G[Continue to face matching]
    F -->|No - Spoof| H[Reject + alert]

    F -->|Borderline| I[Escalate to V-KYC or retry]

    style G fill:#2E7D32,color:#fff
    style H fill:#e53935,color:#fff

Threshold Setting¶

Threshold Strategy	FAR	FRR	Use Case
Low threshold (0.3)	Higher (more spoofs pass)	Lower (fewer real users rejected)	Convenience-first
Balanced (0.5)	Moderate	Moderate	Standard eKYC
High threshold (0.7)	Lower (fewer spoofs pass)	Higher (more real users rejected)	High-security
Very high (0.9)	Very low	Very high	Critical systems (border control)

Key Takeaways¶

Summary

Face liveness is the most critical AI component in eKYC — without it, face matching is meaningless
Attacks range from simple prints to sophisticated real-time deepfakes
Passive liveness (single image) is trending over active (challenge-response) for better UX
Domain generalization is the biggest open challenge — models fail on unseen attacks
Auxiliary supervision (depth maps, reflection maps) improves robustness beyond binary classification
Threshold tuning balances security (low FAR) against usability (low FRR)
The field is rapidly evolving — new attacks and defenses emerge continuously