2.1 Active Liveness Detection¶
Overview¶
Active liveness detection requires the user to perform specific actions in response to system-generated prompts. The system verifies both that the actions are performed correctly and that they exhibit characteristics consistent with a live human being.
sequenceDiagram
participant U as User
participant A as App (Client)
participant S as Server
A->>S: Request liveness session
S->>A: Session ID + Random challenge sequence
A->>U: Display challenge: "Turn head left"
U->>A: Performs action (camera captures)
A->>A: Local face tracking & quality check
A->>U: Display challenge: "Smile"
U->>A: Performs action
A->>A: Local verification
A->>S: Encrypted frames + metadata
S->>S: Active challenge verification
S->>S: Passive liveness analysis (parallel)
S->>S: Deepfake detection (parallel)
S->>A: Liveness result
Challenge Types in Detail¶
1. Head Movement Challenges¶
The user is instructed to turn their head in a specific direction (left, right, up, down) or perform a sequence of movements.
How It Works:
graph TD
A["Challenge Issued:<br>'Turn head left'"] --> B["Face Landmark<br>Tracking (478 points)"]
B --> C["3D Head Pose<br>Estimation"]
C --> D{"Yaw angle<br>changed > 15°?"}
D -->|Yes| E["Analyze Motion<br>Naturalness"]
D -->|No| F["Timeout / Retry"]
E --> G{"Natural motion<br>characteristics?"}
G -->|Yes| H["✅ Challenge Passed"]
G -->|No| I["❌ Suspicious Motion"]
style H fill:#27ae60,color:#fff
style I fill:#e74c3c,color:#fff
What the system analyzes:
| Signal | Live Person | Photo/Screen Attack | 3D Mask |
|---|---|---|---|
| Yaw/Pitch/Roll change | Smooth, continuous | None or rigid (whole image moves) | Can simulate but lacks skin deformation |
| Motion blur | Natural blur at edges during rotation | Absent or uniform blur | Minimal, unnatural |
| Parallax effects | Nose tip moves faster than ears, depth-consistent | No parallax — flat image | Approximate parallax but incorrect for flexible features |
| Skin deformation | Neck skin folds, cheek compression | None | Absent or artificial |
| Temporal dynamics | Acceleration/deceleration curve matches human biomechanics | Instantaneous or mechanical | Approximate but measurable differences |
| Background consistency | Background perspective shifts with head movement | Background moves with face (screen) or stays static (photo) | Background may be visible around mask edges |
Implementation parameters:
| Parameter | Recommended Value | Rationale |
|---|---|---|
| Minimum yaw change | 15-25° | Below 15° is too easy to fake; above 25° causes user discomfort |
| Maximum allowed time | 3-5 seconds per direction | Prevents coached/assisted attacks while accommodating normal users |
| Minimum angular velocity | 5°/second | Ensures motion is intentional, not environmental vibration |
| Frame rate for tracking | 15-30 FPS | Below 15 FPS loses motion detail; above 30 adds no value for most cameras |
| Landmark model | 478-point mesh (MediaPipe) or 68-point (dlib) | 478-point provides superior head pose accuracy |
2. Facial Expression Challenges¶
The user is prompted to perform specific facial expressions — smile, raise eyebrows, open mouth, close eyes.
How It Works:
The system uses the Facial Action Coding System (FACS) to verify genuine muscle movements. Each expression corresponds to specific Action Units (AUs):
| Expression | Primary Action Units | What System Checks |
|---|---|---|
| Smile | AU6 (cheek raise) + AU12 (lip corner pull) | Both AUs activate together (Duchenne smile indicators); cheek muscles lift; crow's feet appear around eyes |
| Blink | AU45 (blink) | Lid closure speed (150-400ms natural), lid opening trajectory, simultaneous bilateral closure |
| Raised Eyebrows | AU1 (inner brow raise) + AU2 (outer brow raise) | Forehead wrinkles appear; skin texture changes dynamically; natural asymmetry |
| Open Mouth | AU25 (lips part) + AU26 (jaw drop) | Jaw hinge motion, teeth visibility progression, lip deformation |
| Puffed Cheeks | AU34 (puff) | Bilateral cheek expansion, chin muscle tension, natural asymmetry |
Key detection signals:
- Muscle activation authenticity: Real expressions involve coordinated muscle groups. A photo bent to simulate a smile doesn't produce the eye crinkle (AU6) or cheek lift.
- Temporal dynamics: Real expressions have natural onset (200-500ms), peak, and offset phases. Deepfakes often show unnatural timing.
- Micro-expression leakage: Before and after the requested expression, involuntary micro-expressions occur in genuine presentations.
- Skin texture change: Wrinkles appear and disappear dynamically with expressions; paper and screens can't replicate this.
Accessibility Concern
Expression-based challenges can be difficult or impossible for users with facial paralysis (Bell's palsy), Parkinson's disease, post-stroke conditions, or Moebius syndrome. Always provide alternative challenge types or passive liveness fallback for accessibility compliance.
3. Gaze Tracking Challenges¶
The user follows a moving target on the screen with their eyes.
How It Works:
graph TD
A["Random target<br>appears on screen"] --> B["User tracks<br>with eyes"]
B --> C["Iris position<br>extracted per frame"]
C --> D["Gaze trajectory<br>mapped"]
D --> E{"Trajectory matches<br>target path?"}
E -->|Match| F["Analyze saccadic<br>patterns"]
F --> G{"Natural eye<br>movement?"}
G -->|Yes| H["✅ Pass"]
G -->|No| I["❌ Fail"]
style H fill:#27ae60,color:#fff
style I fill:#e74c3c,color:#fff
What makes gaze tracking effective:
| Signal | Why It's Hard to Fake |
|---|---|
| Saccadic movements | Rapid, involuntary eye jumps between fixation points. Unique to live eyes. Average velocity: 300-500°/s |
| Smooth pursuit | Eyes track moving targets with a slight lag (50-100ms). Videos show no pupil tracking |
| Vergence | Eyes converge/diverge based on target distance. This requires real binocular vision |
| Pupillary response | Pupils dilate/constrict with light changes. Moving from bright to dark areas of screen produces measurable pupil change |
| Vestibulo-ocular reflex | Eyes counter-rotate when head moves to stabilize gaze. Photos can't exhibit this |
| Microsaccades | Tiny involuntary eye movements during fixation (0.2-1°). Present in all live eyes, absent in photos/videos |
4. Color Sequence Illumination¶
The device screen flashes a random sequence of colors while the front camera captures the user's face.
How It Works:
graph TD
A["Generate random<br>color sequence:<br>R→G→B→W→R→B"] --> B["Flash each color<br>for 200-500ms"]
B --> C["Capture face under<br>each illumination"]
C --> D["Analyze per-color<br>reflectance response"]
D --> E{"Response matches<br>expected skin<br>reflectance model?"}
E -->|Yes| F["✅ Live"]
E -->|No| G["❌ Spoof"]
style F fill:#27ae60,color:#fff
style G fill:#e74c3c,color:#fff
Why it works:
- Subsurface scattering: Real skin absorbs, scatters, and re-emits light differently at different wavelengths. Red light penetrates deeper into skin than blue light, causing different color-dependent reflectance patterns.
- Screen attacks fail: A screen displaying a face cannot react to the illumination from another screen. The pre-recorded face image has fixed lighting, so it doesn't change color naturally when illuminated.
- Print attacks fail: Paper has fundamentally different spectral reflectance than skin at every wavelength.
- 3D mask attacks: Silicone and latex have different spectral absorption characteristics than skin, though this is the hardest case.
Implementation considerations:
| Parameter | Value | Notes |
|---|---|---|
| Number of colors | 4-8 | More colors = higher security, longer process |
| Duration per color | 200-500ms | Must be long enough for camera exposure adjustment |
| Color randomization | Per-session random sequence | Prevents replay of pre-recorded color responses |
| Color choices | Red, Green, Blue, White, Cyan, Magenta, Yellow | Primary and secondary colors for maximum spectral diversity |
| Ambient light baseline | Captured before sequence starts | Used to normalize color response measurements |
5. Speech-Based Challenges¶
The user reads a randomly generated phrase, number sequence, or one-time code displayed on screen.
How It Works:
The system performs multi-modal analysis:
| Analysis Layer | What's Checked |
|---|---|
| Lip-sync correlation | Mouth movements match the spoken audio timing and phoneme shapes |
| Voice liveness | Audio characteristics confirm a live voice (not text-to-speech or replay) |
| Content verification | The spoken words match the displayed prompt (ASR verification) |
| Temporal alignment | Audio and video are synchronized within acceptable tolerance (< 100ms) |
| Replay detection | Environmental acoustics match the visual environment; no echo/reverberation anomalies |
Multi-Modal Strength
Speech-based challenges are particularly powerful because they require coordination of visual (lip movements), audio (speech), and cognitive (reading comprehension) channels simultaneously. This makes it extremely difficult for any single attack modality to succeed.
Limitations
- Not suitable for users with speech impairments, hearing loss, or in noisy environments
- Privacy concerns with voice capture in some jurisdictions
- Real-time lip-sync deepfakes (Wav2Lip) can now defeat basic lip-sync analysis
- Adds 5-10 seconds to the verification process
Challenge Randomization Strategy¶
The randomization of challenges is critical — predictable challenges can be pre-recorded and replayed.
graph TD
A["Server generates<br>challenge pool"] --> B["Random selection<br>(cryptographically secure)"]
B --> C["Challenge sequence<br>generated"]
C --> D{"Challenge type<br>randomization"}
D --> E["Random direction<br>(head movement)"]
D --> F["Random expression<br>(smile/blink/brow)"]
D --> G["Random gaze target<br>position & path"]
D --> H["Random color<br>sequence"]
D --> I["Random speech<br>phrase"]
E --> J["Sequence: 2-4<br>challenges from<br>different types"]
F --> J
G --> J
H --> J
I --> J
J --> K["Bound to session<br>with nonce +<br>timestamp"]
Randomization principles:
- Challenge type randomization: Don't always use the same challenge type. Mix head movements with expressions with gaze tracking.
- Parameter randomization: Within each type, randomize direction (left vs right), expression (smile vs blink), target position, color sequence, and speech content.
- Sequence length randomization: Vary the number of challenges (2-4) per session.
- Timing unpredictability: Vary the delay between challenges (0.5-2 seconds).
- Server-side generation: Challenges must be generated server-side and bound to the session with cryptographic nonces. Client-side generation is trivially bypassable.
Scoring & Decision Logic¶
graph TD
A["Challenge 1<br>Score: 0.92"] --> E["Weighted<br>Aggregation"]
B["Challenge 2<br>Score: 0.87"] --> E
C["Challenge 3<br>Score: 0.95"] --> E
D["Temporal Consistency<br>Score: 0.90"] --> E
E --> F["Active Liveness<br>Score: 0.91"]
F --> G{"Threshold<br>Check"}
G -->|"≥ 0.85"| H["✅ Active<br>Liveness Pass"]
G -->|"0.60 - 0.85"| I["⚠️ Additional<br>Challenge"]
G -->|"< 0.60"| J["❌ Active<br>Liveness Fail"]
style H fill:#27ae60,color:#fff
style I fill:#f39c12,color:#fff
style J fill:#e74c3c,color:#fff
Advantages & Disadvantages Summary¶
| Aspect | Rating | Details |
|---|---|---|
| Security against 2D attacks | ⭐⭐⭐⭐⭐ | Challenge randomization makes photo/screen replay extremely difficult |
| Security against 3D masks | ⭐⭐⭐⭐ | Effective for rigid masks; flexible masks may simulate some movements |
| Security against deepfakes | ⭐⭐⭐⭐ | Multi-modal challenges increase difficulty; real-time deepfakes can partially respond |
| User experience | ⭐⭐⭐ | Adds 5-15 seconds; clear instructions needed; some users find it confusing |
| Accessibility | ⭐⭐ | Challenging for users with motor/visual/speech impairments |
| Drop-off rate | ⭐⭐⭐ | Typically 10-25% drop-off depending on challenge complexity and UX quality |
| Processing speed | ⭐⭐⭐ | 5-15 seconds for full challenge sequence |
| Device requirements | ⭐⭐⭐⭐ | Standard front camera + display; no special hardware needed |
| Regulatory acceptance | ⭐⭐⭐⭐⭐ | Widely accepted; explainable security model that regulators understand |
Best Practices for Banking Deployment¶
Implementation Recommendations
- Always combine with passive liveness — Active challenges alone miss texture/frequency signals
- Limit to 2-3 challenges per session to minimize drop-off
- Provide clear visual guidance — Animated overlays showing expected motion
- Implement progressive difficulty — Start easy, escalate if risk signals detected
- Offer accessibility alternatives — Passive-only mode or Video KYC fallback for users who can't perform challenges
- Server-side validation is mandatory — Never trust client-side challenge verification alone
- Monitor challenge completion rates — If a specific challenge has >15% failure rate among genuine users, recalibrate or replace it
- Randomize everything — Challenge type, direction, sequence, timing