2.2 Passive Liveness Detection¶

Overview¶

Passive liveness detection analyzes a single image or short video clip without requiring any explicit user interaction. The system extracts features from the captured face that distinguish live presentations from spoofed ones — all transparently, in the background.

The User's Experience

From the user's perspective, passive liveness is invisible. They simply take a selfie or look at the camera, and the system makes its determination. This results in a seamless, frictionless experience with the highest completion rates.

graph TD
    A["📸 User takes<br>selfie / looks<br>at camera"] --> B["Face Detection<br>& Quality Check"]
    B --> C["Feature<br>Extraction"]
    C --> D["Multi-Signal<br>Analysis"]
    D --> E["Liveness<br>Score"]
    E --> F{"Decision"}
    F -->|"Live"| G["✅ Continue<br>to face matching"]
    F -->|"Spoof"| H["❌ Reject"]

    style G fill:#27ae60,color:#fff
    style H fill:#e74c3c,color:#fff

Core Analysis Methods¶

1. Texture Analysis (Spatial Domain)¶

The most fundamental passive liveness signal. Deep learning models analyze micro-texture patterns at the pixel level.

What the model looks for:

Feature	Live Skin	Printed Photo	Screen Display	3D Mask
Pore structure	Natural, irregular pore distribution	Halftone dot patterns replace pores	Pixel grid visible under analysis	Smooth or artificially textured
Skin micro-texture	Rich, multi-scale texture with natural variation	Ink dot patterns, paper fiber texture	RGB sub-pixel patterns, aliasing	Silicone/latex grain, paint texture
Specular highlights	Single, consistent environmental reflections on oily regions	Matte paper: absent. Glossy: paper-like reflectance	Glass/screen reflections overlaid on face	Material-dependent, often too uniform
Color distribution	Wide, natural skin color gamut with subsurface warmth	Limited printer gamut, shifted color balance	Backlit colors, potential color banding	Paint/pigment gamut, often slightly off
Edge characteristics	Smooth 3D-to-background transition with natural depth-of-field	Sharp 2D cutout edges, paper boundary visible	Screen bezel boundary, Moiré at edges	Mask-to-skin boundary visible

How texture models work internally:

graph TD
    A["Input Face Image<br>224×224 or 256×256"] --> B["Backbone CNN<br>(MobileNetV3 / EfficientNet)"]
    B --> C["Multi-Scale<br>Feature Maps"]
    C --> D["Low-level features:<br>Edge patterns, noise<br>(Layers 1-3)"]
    C --> E["Mid-level features:<br>Texture patterns, pores<br>(Layers 4-8)"]
    C --> F["High-level features:<br>Semantic understanding<br>(Layers 9+)"]
    D --> G["Feature<br>Aggregation"]
    E --> G
    F --> G
    G --> H["Classification Head"]
    H --> I["Liveness Score<br>0.0 → 1.0"]

2. Depth Estimation¶

Neural networks estimate the 3D geometry of the face from a single 2D image. Live faces produce consistent 3D depth maps; flat attacks produce anomalous depth.

Depth map comparison:

Scenario	Expected Depth Map	Key Characteristics
Live face	Nose protrudes (closest to camera), eyes recessed in orbits, cheeks curve away, chin projects forward	Smooth, anatomically consistent depth gradients; 40-80mm depth range across face
Printed photo	Flat plane with minor curvature from paper bending	Near-uniform depth; no anatomical depth structure; < 5mm depth variation
Screen display	Flat plane matching screen surface	Perfectly uniform depth; possible slight concavity from screen curvature
3D rigid mask	Approximate facial geometry but incorrect in fine details	Exaggerated or incorrect nasal bridge, missing orbital depth, uniform surface without fine detail
3D flexible mask	Close to real but with detectable differences	Subtle geometric deviations, especially around eyes, nostrils, and mouth

Training approach for depth-based liveness:

The model is trained with auxiliary depth supervision — alongside the liveness label, the model learns to predict a depth map:

For live faces: Ground truth depth map generated from a 3D face reconstruction model (e.g., 3DDFA, DECA)
For spoof faces: Ground truth depth is a zero/flat map
The model learns that "being live" is associated with "having valid 3D structure"

3. Frequency Domain Analysis¶

Fourier Transform and Wavelet analysis reveal frequency signatures characteristic of different media.

graph TD
    A["Face Image"] --> B["2D FFT<br>(Fourier Transform)"]
    B --> C["Power Spectrum"]
    C --> D["Frequency<br>Signature Analysis"]

    D --> E["High-frequency peaks<br>at regular intervals?<br>→ Print halftone"]
    D --> F["RGB sub-pixel<br>frequency peaks?<br>→ Screen display"]
    D --> G["Natural 1/f<br>noise falloff?<br>→ Live face"]
    D --> H["GAN upsampling<br>artifacts?<br>→ Deepfake"]

Frequency signatures by attack type:

Attack Type	Frequency Domain Signature
Live face	Natural 1/f noise spectrum (power decreases inversely with frequency); sensor noise pattern consistent with camera model
Laser/inkjet print	Periodic peaks corresponding to halftone screen frequency (typically 150-300 LPI); paper texture frequency
LCD screen	Peaks at pixel pitch frequency (varies with screen PPI); RGB sub-pixel pattern frequencies; backlight frequency
OLED screen	Sub-pixel pattern (PenTile or RGB stripe); different from LCD due to pixel layout
GAN-generated	Upsampling artifacts at specific frequencies; checkerboard patterns from transposed convolutions; GAN fingerprint
Deepfake (face swap)	Blending boundary frequencies; inconsistent noise patterns between swapped face and surrounding area

4. Reflection & Specularity Analysis¶

Analysis of how light interacts with the presentation surface.

Key signals:

Corneal reflections: Live eyes show clear reflections of the environment (lights, windows, screens). The reflection should be consistent with the ambient scene and identical in both eyes. Screens show screen-within-screen reflections.
Skin specularity: Oily skin areas (T-zone: forehead, nose, chin) show specular highlights that move consistently with head motion and lighting direction.
Double reflection: Screen-based attacks often show two reflection sources — one from the original scene captured in the photo/video, and one from the attack screen surface.
Polarization cues: Though not capturable with standard cameras, the principle applies — reflections from glass (screens) are partially polarized differently than reflections from skin.

5. Image Quality & Artifact Detection¶

Detection of artifacts that indicate a non-live source.

Artifact	Indicates	Detection Method
Moiré patterns	Screen capture of another screen	Frequency analysis for periodic interference patterns
JPEG compression artifacts	Re-compressed image (not fresh capture)	Block boundary analysis, quantization table detection
Color banding	Limited bit depth or color gamut compression	Gradient analysis in smooth regions (cheeks, forehead)
Pixel repetition	Digital zoom or upscaling	Auto-correlation analysis for repeating pixel patterns
Edge ringing	Sharpening artifacts from processing	High-pass filter analysis near strong edges
Noise inconsistency	Composited image (different noise levels in face vs background)	Local noise estimation across image regions
Lens distortion absence	Non-camera source (rendered or stitched image)	Radial distortion model fitting
EXIF metadata anomalies	Modified or fabricated image	Metadata consistency checks (though unreliable — easily faked)

6. Remote Photoplethysmography (rPPG)¶

Even from a "single image" passive approach, short video clips (2-5 seconds) enable rPPG analysis — one of the strongest passive liveness signals.

graph TD
    A["Short video clip<br>(2-5 seconds, 15-30 FPS)"] --> B["Face region<br>of interest (ROI)<br>extraction"]
    B --> C["Per-frame mean<br>color values<br>(R, G, B channels)"]
    C --> D["Temporal signal<br>extraction"]
    D --> E["Bandpass filter<br>(0.7 - 4.0 Hz)<br>= 42-240 BPM"]
    E --> F["FFT / Peak<br>detection"]
    F --> G{"Periodic signal<br>detected at<br>heart rate frequency?"}
    G -->|"Yes: Clear periodic signal"| H["✅ Live<br>(heart beating)"]
    G -->|"No: Flat/noisy signal"| I["❌ Spoof<br>(no blood flow)"]

    style H fill:#27ae60,color:#fff
    style I fill:#e74c3c,color:#fff

Why rPPG is powerful:

Blood flow causes micro-color changes in skin (imperceptible to the human eye but detectable by cameras) synchronized with the heartbeat
No known attack can synthetically reproduce physiologically accurate rPPG signals in real-time
Works as a strong supplementary signal even when other passive methods are uncertain
Limitation: Requires 2-5 second video, not a single frame; affected by motion, compression, and very dark skin tones

Model Architecture for Passive Liveness¶

Recommended Architecture: Multi-Task Learning¶

graph TD
    A["Input: Face Image<br>256×256×3"] --> B["Shared Backbone<br>(EfficientNet-B0 /<br>MobileNetV3-Large)"]
    B --> C["Feature Maps<br>Multi-scale"]

    C --> D["Head 1:<br>Binary Liveness<br>(Live / Spoof)"]
    C --> E["Head 2:<br>Depth Map<br>Estimation"]
    C --> F["Head 3:<br>Attack Type<br>Classification"]
    C --> G["Head 4:<br>Domain Classifier<br>(for DG training)"]

    D --> H["Liveness<br>Score"]
    E --> I["Depth Map<br>32×32"]
    F --> J["Attack Type<br>Probabilities"]
    G --> K["Domain<br>Prediction"]

    H --> L["Score Fusion<br>& Decision"]
    I --> L
    J --> L

Multi-task benefits:

Depth map supervision provides better gradient signal than binary classification alone
Attack type classification enables interpretable decisions (know what type of attack was detected)
Domain classifier (with gradient reversal) enables domain-invariant features for better generalization

Advantages & Disadvantages¶

Aspect	Rating	Details
User experience	⭐⭐⭐⭐⭐	Zero friction — user just takes a selfie; highest completion rates
Accessibility	⭐⭐⭐⭐⭐	No motor/speech/cognitive requirements; works for all users
Processing speed	⭐⭐⭐⭐⭐	100-500ms for single-frame; 1-3s with short video
Drop-off rate	⭐⭐⭐⭐⭐	2-5% (mostly due to camera quality, not liveness UX)
Security (2D attacks)	⭐⭐⭐⭐	Strong against prints and basic screen replays
Security (3D masks)	⭐⭐⭐	Moderate — texture analysis helps but geometry can fool depth models
Security (deepfakes)	⭐⭐⭐	Moderate — depends on deepfake quality and detector sophistication
Security (injection attacks)	⭐⭐	Weak if only analyzing image content without device/session validation
Regulatory acceptance	⭐⭐⭐⭐	Growing acceptance; some regulators still prefer active challenge evidence
Explainability	⭐⭐⭐	Harder to explain "why rejected" to regulators compared to active methods

When to Use Passive Liveness¶

Ideal For

High-volume, low-friction onboarding where drop-off reduction is critical
First-pass screening before active challenge escalation
Inclusive deployments where accessibility is a hard requirement
Markets where users are less tech-savvy and active challenges cause confusion
Transaction authentication where speed is essential

Not Sufficient Alone For

High-value banking transactions (combine with active or multi-modal)
Jurisdictions where regulators explicitly require active challenge evidence
Deployments where sophisticated deepfake attacks are a primary threat
Scenarios where virtual camera injection is a known attack vector

Next: Hybrid & Adaptive Liveness →