🧠 Human Perception Demo

Ultra-low latency real-time human perception using MediaPipe on macOS (Apple Silicon)

🎯 What This Does

A pure vision demonstration that detects and analyzes human presence in real-time with comprehensive metrics:

Module	Metrics Tracked
Hands	Gesture recognition (peace, thumbs_up, fist, pointing, etc.), finger count, finger states, hand openness, wrist angles, palm direction
Face	Emotion detection (happy, surprised, angry, sad, neutral), smile/jaw/brow scores, eye blink states, 20+ blendshape metrics
Body	Posture score, shoulder angles, torso lean, elbow angles, arm positions, arm openness, hands raised detection
Attention	Focus state (focused/distracted/absent), gaze direction, head pitch/yaw/roll angles

All processing runs locally with sub-100ms latency on Apple Silicon.

✨ Features

🚀 Ultra-Low Latency: End-to-end perception < 100ms
🔒 100% Local: No cloud, no data leaves your machine
📊 Comprehensive Metrics: 50+ tracked parameters
🎨 Real-time HUD: Live visualization of all perception data
⚡ Event-Driven: Async architecture, no polling
🔄 Temporal Stability: All detections require time-stable confirmation

Demo

Run with --debug to see the real-time visualization:


python -m jarvis_perception.main --debug

The HUD displays:

HANDS: Gesture name, finger count, hand openness per hand
FACE: Detected emotion, expression scores, eye states
ATTENTION: Focus state, head orientation angles
BODY: Posture quality, shoulder/torso metrics, arm positions
Performance: Live latency in milliseconds

🛠️ Requirements

macOS (Apple Silicon recommended, Intel supported)
Python 3.10+
Webcam

📦 Installation


# Clone the repository
git clone https://github.com/quyangminddock/jarvis-perception.git
cd jarvis-perception

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Download MediaPipe models (required)
mkdir -p models
# Download from: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker
# - hand_landmarker.task
# - face_landmarker.task  
# - pose_landmarker.task

🚀 Quick Start


# Run with debug visualization (recommended)
python -m jarvis_perception.main --debug

# Run without visualization
python -m jarvis_perception.main

# Specify camera
python -m jarvis_perception.main --debug --camera 1

Command Line Options

Option	Description
`--debug, -d`	Enable debug visualization window
`--camera, -c`	Camera device ID (default: 0)
`--width`	Frame width (default: 640)
`--height`	Frame height (default: 480)
`--verbose, -v`	Enable verbose logging

🏗️ Architecture


┌─────────────────┐
│     Camera      │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────┐
│   Non-Blocking Ring Buffer      │
│      (Dedicated Thread)         │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│      Vision Thread Pool         │
│  ┌─────────┐ ┌─────────┐        │
│  │  Hands  │ │  Face   │        │
│  └─────────┘ └─────────┘        │
│  ┌─────────┐ ┌─────────┐        │
│  │  Pose   │ │Head Pose│        │
│  └─────────┘ └─────────┘        │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│   Temporal Stability Layer      │
│  (Sliding Window Validation)    │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│     Async Event Bus             │
│   (State Machine Fusion)        │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│   Your Application / HUD        │
└─────────────────────────────────┘

📊 Tracked Metrics

Hands (`HandResult`)

Metric	Type	Description
`gesture_names`	`List[str]`	Recognized gesture per hand (fist, open_palm, peace, thumbs_up, pointing, etc.)
`finger_counts`	`List[int]`	Extended finger count per hand
`finger_states`	`List[List[bool]]`	Per-finger extended state [thumb, index, middle, ring, pinky]
`hand_openness`	`List[float]`	Hand openness 0-1 per hand
`wrist_angles`	`List[float]`	Wrist rotation in degrees
`palm_directions`	`List[str]`	Palm facing direction
`handedness`	`List[str]`	Left/Right per hand

Face (`FaceResult`)

Metric	Type	Description
`emotion`	`str`	Detected emotion: happy, surprised, angry, sad, neutral
`smile_score`	`float`	Smile intensity 0-1
`jaw_open`	`float`	Mouth openness 0-1
`eyebrow_score`	`float`	Raised eyebrows (surprise indicator)
`blink_left/right`	`bool`	Eye blink states
`eye_wide_left/right`	`float`	Eye wideness (surprise)
`brow_down_left/right`	`float`	Frown indicators
`cheek_puff`	`float`	Cheek puff score
`lip_pucker`	`float`	Lip pucker score

Body (`PoseResult`)

Metric	Type	Description
`posture_score`	`float`	Overall posture quality 0-1
`shoulder_angle`	`float`	Shoulder tilt in degrees
`torso_lean`	`float`	Body lean left/right in degrees
`left/right_elbow_angle`	`float`	Elbow bend angles
`arm_openness`	`float`	How spread the arms are 0-1
`hands_above_shoulders`	`List[str]`	Which hands are raised
`body_center`	`tuple`	Torso center (x, y)

Attention (`HeadPoseResult`)

Metric	Type	Description
`pitch`	`float`	Up/down rotation (nodding)
`yaw`	`float`	Left/right rotation (shaking)
`roll`	`float`	Head tilt
`looking_at_screen`	`bool`	User focused on screen
`attention_direction`	`str`	center, left, right, up, down

🎯 Design Principles

No Cloud Dependencies: Everything runs locally
No Vision LLMs: Uses lightweight, specialized models only
Temporal Stability Required: No single-frame conclusions
Non-Blocking: Inference threads never block main loop
Event-Driven: Push-based, not polling
Frame Drops Allowed: Prioritize latency over processing every frame

📁 Project Structure


jarvis-perception/
├── jarvis_perception/
│   ├── main.py              # Entry point
│   ├── config.py            # Configuration
│   ├── vision/              # Detectors
│   │   ├── hands_detector.py
│   │   ├── face_detector.py
│   │   ├── pose_detector.py
│   │   └── head_pose_estimator.py
│   ├── fusion/              # State machines
│   │   ├── gesture_state.py
│   │   └── attention_state.py
│   ├── core/                # Event system
│   │   └── event_bus.py
│   ├── capture/             # Camera capture
│   │   └── camera_capture.py
│   └── debug/               # Visualization
│       └── visualizer.py
├── models/                  # MediaPipe model files
├── requirements.txt
└── README.md

🔧 Configuration

Edit config.py to customize:


# Vision detection confidence
vision.hand_detection_confidence = 0.7
vision.face_detection_confidence = 0.7

# Head pose thresholds (degrees)
vision.yaw_threshold = 15.0    # Left/right sensitivity
vision.pitch_threshold = 35.0  # Up/down sensitivity (relaxed for laptops)

# Temporal stability
stability.finger_window_size = 5
stability.attention_debounce_ms = 300