logo
0
0
WeChat Login

🧠 Human Perception Demo

Ultra-low latency real-time human perception using MediaPipe on macOS (Apple Silicon)

License: MIT Python 3.10+ macOS

🎯 What This Does

A pure vision demonstration that detects and analyzes human presence in real-time with comprehensive metrics:

ModuleMetrics Tracked
HandsGesture recognition (peace, thumbs_up, fist, pointing, etc.), finger count, finger states, hand openness, wrist angles, palm direction
FaceEmotion detection (happy, surprised, angry, sad, neutral), smile/jaw/brow scores, eye blink states, 20+ blendshape metrics
BodyPosture score, shoulder angles, torso lean, elbow angles, arm positions, arm openness, hands raised detection
AttentionFocus state (focused/distracted/absent), gaze direction, head pitch/yaw/roll angles

All processing runs locally with sub-100ms latency on Apple Silicon.

✨ Features

  • 🚀 Ultra-Low Latency: End-to-end perception < 100ms
  • 🔒 100% Local: No cloud, no data leaves your machine
  • 📊 Comprehensive Metrics: 50+ tracked parameters
  • 🎨 Real-time HUD: Live visualization of all perception data
  • Event-Driven: Async architecture, no polling
  • 🔄 Temporal Stability: All detections require time-stable confirmation

Demo

Run with --debug to see the real-time visualization:

python -m jarvis_perception.main --debug

The HUD displays:

  • HANDS: Gesture name, finger count, hand openness per hand
  • FACE: Detected emotion, expression scores, eye states
  • ATTENTION: Focus state, head orientation angles
  • BODY: Posture quality, shoulder/torso metrics, arm positions
  • Performance: Live latency in milliseconds

🛠️ Requirements

  • macOS (Apple Silicon recommended, Intel supported)
  • Python 3.10+
  • Webcam

📦 Installation

# Clone the repository
git clone https://github.com/quyangminddock/jarvis-perception.git
cd jarvis-perception

# Create virtual environment
python3 -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Download MediaPipe models (required)
mkdir -p models
# Download from: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker
# - hand_landmarker.task
# - face_landmarker.task  
# - pose_landmarker.task

🚀 Quick Start

# Run with debug visualization (recommended)
python -m jarvis_perception.main --debug

# Run without visualization
python -m jarvis_perception.main

# Specify camera
python -m jarvis_perception.main --debug --camera 1

Command Line Options

OptionDescription
--debug, -dEnable debug visualization window
--camera, -cCamera device ID (default: 0)
--widthFrame width (default: 640)
--heightFrame height (default: 480)
--verbose, -vEnable verbose logging

🏗️ Architecture

┌─────────────────┐
│     Camera      │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────┐
│   Non-Blocking Ring Buffer      │
│      (Dedicated Thread)         │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│      Vision Thread Pool         │
│  ┌─────────┐ ┌─────────┐        │
│  │  Hands  │ │  Face   │        │
│  └─────────┘ └─────────┘        │
│  ┌─────────┐ ┌─────────┐        │
│  │  Pose   │ │Head Pose│        │
│  └─────────┘ └─────────┘        │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│   Temporal Stability Layer      │
│  (Sliding Window Validation)    │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│     Async Event Bus             │
│   (State Machine Fusion)        │
└────────┬────────────────────────┘
         │
         ▼
┌─────────────────────────────────┐
│   Your Application / HUD        │
└─────────────────────────────────┘

📊 Tracked Metrics

Hands (HandResult)

MetricTypeDescription
gesture_namesList[str]Recognized gesture per hand (fist, open_palm, peace, thumbs_up, pointing, etc.)
finger_countsList[int]Extended finger count per hand
finger_statesList[List[bool]]Per-finger extended state [thumb, index, middle, ring, pinky]
hand_opennessList[float]Hand openness 0-1 per hand
wrist_anglesList[float]Wrist rotation in degrees
palm_directionsList[str]Palm facing direction
handednessList[str]Left/Right per hand

Face (FaceResult)

MetricTypeDescription
emotionstrDetected emotion: happy, surprised, angry, sad, neutral
smile_scorefloatSmile intensity 0-1
jaw_openfloatMouth openness 0-1
eyebrow_scorefloatRaised eyebrows (surprise indicator)
blink_left/rightboolEye blink states
eye_wide_left/rightfloatEye wideness (surprise)
brow_down_left/rightfloatFrown indicators
cheek_pufffloatCheek puff score
lip_puckerfloatLip pucker score

Body (PoseResult)

MetricTypeDescription
posture_scorefloatOverall posture quality 0-1
shoulder_anglefloatShoulder tilt in degrees
torso_leanfloatBody lean left/right in degrees
left/right_elbow_anglefloatElbow bend angles
arm_opennessfloatHow spread the arms are 0-1
hands_above_shouldersList[str]Which hands are raised
body_centertupleTorso center (x, y)

Attention (HeadPoseResult)

MetricTypeDescription
pitchfloatUp/down rotation (nodding)
yawfloatLeft/right rotation (shaking)
rollfloatHead tilt
looking_at_screenboolUser focused on screen
attention_directionstrcenter, left, right, up, down

🎯 Design Principles

  1. No Cloud Dependencies: Everything runs locally
  2. No Vision LLMs: Uses lightweight, specialized models only
  3. Temporal Stability Required: No single-frame conclusions
  4. Non-Blocking: Inference threads never block main loop
  5. Event-Driven: Push-based, not polling
  6. Frame Drops Allowed: Prioritize latency over processing every frame

📁 Project Structure

jarvis-perception/
├── jarvis_perception/
│   ├── main.py              # Entry point
│   ├── config.py            # Configuration
│   ├── vision/              # Detectors
│   │   ├── hands_detector.py
│   │   ├── face_detector.py
│   │   ├── pose_detector.py
│   │   └── head_pose_estimator.py
│   ├── fusion/              # State machines
│   │   ├── gesture_state.py
│   │   └── attention_state.py
│   ├── core/                # Event system
│   │   └── event_bus.py
│   ├── capture/             # Camera capture
│   │   └── camera_capture.py
│   └── debug/               # Visualization
│       └── visualizer.py
├── models/                  # MediaPipe model files
├── requirements.txt
└── README.md

🔧 Configuration

Edit config.py to customize:

# Vision detection confidence
vision.hand_detection_confidence = 0.7
vision.face_detection_confidence = 0.7

# Head pose thresholds (degrees)
vision.yaw_threshold = 15.0    # Left/right sensitivity
vision.pitch_threshold = 35.0  # Up/down sensitivity (relaxed for laptops)

# Temporal stability
stability.finger_window_size = 5
stability.attention_debounce_ms = 300

📄 License

MIT License - feel free to use in your projects!

🤝 Contributing

Contributions welcome! Please feel free to submit a Pull Request.

🔗 Links


Built with ❤️ for real-time human perception research

About

> Ultra-low latency real-time human perception using MediaPipe on macOS (Apple Silicon)

Language
Python99%
Shell1.1%