logo
0
0
WeChat Login

🧠 Human Perception Demo

Ultra-low latency real-time human perception using MediaPipe on macOS (Apple Silicon)

License: MIT Python 3.10+ macOS

🎯 What This Does

A pure vision demonstration that detects and analyzes human presence in real-time with comprehensive metrics:

ModuleMetrics Tracked
HandsGesture recognition (peace, thumbs_up, fist, pointing, etc.), finger count, finger states, hand openness, wrist angles, palm direction
FaceEmotion detection (happy, surprised, angry, sad, neutral), smile/jaw/brow scores, eye blink states, 20+ blendshape metrics
BodyPosture score, shoulder angles, torso lean, elbow angles, arm positions, arm openness, hands raised detection
AttentionFocus state (focused/distracted/absent), gaze direction, head pitch/yaw/roll angles

All processing runs locally with sub-100ms latency on Apple Silicon.

✨ Features

  • 🚀 Ultra-Low Latency: End-to-end perception < 100ms
  • 🔒 100% Local: No cloud, no data leaves your machine
  • 📊 Comprehensive Metrics: 50+ tracked parameters
  • 🎨 Real-time HUD: Live visualization of all perception data
  • Event-Driven: Async architecture, no polling
  • 🔄 Temporal Stability: All detections require time-stable confirmation

Demo

Run with --debug to see the real-time visualization:

python -m jarvis_perception.main --debug

The HUD displays:

  • HANDS: Gesture name, finger count, hand openness per hand
  • FACE: Detected emotion, expression scores, eye states
  • ATTENTION: Focus state, head orientation angles
  • BODY: Posture quality, shoulder/torso metrics, arm positions
  • Performance: Live latency in milliseconds

🛠️ Requirements

  • macOS (Apple Silicon recommended, Intel supported)
  • Python 3.10+
  • Webcam

📦 Installation

# Clone the repository git clone https://github.com/quyangminddock/jarvis-perception.git cd jarvis-perception # Create virtual environment python3 -m venv venv source venv/bin/activate # Install dependencies pip install -r requirements.txt # Download MediaPipe models (required) mkdir -p models # Download from: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker # - hand_landmarker.task # - face_landmarker.task # - pose_landmarker.task

🚀 Quick Start

# Run with debug visualization (recommended) python -m jarvis_perception.main --debug # Run without visualization python -m jarvis_perception.main # Specify camera python -m jarvis_perception.main --debug --camera 1

Command Line Options

OptionDescription
--debug, -dEnable debug visualization window
--camera, -cCamera device ID (default: 0)
--widthFrame width (default: 640)
--heightFrame height (default: 480)
--verbose, -vEnable verbose logging

🏗️ Architecture

┌─────────────────┐ │ Camera │ └────────┬────────┘ │ ▼ ┌─────────────────────────────────┐ │ Non-Blocking Ring Buffer │ │ (Dedicated Thread) │ └────────┬────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ Vision Thread Pool │ │ ┌─────────┐ ┌─────────┐ │ │ │ Hands │ │ Face │ │ │ └─────────┘ └─────────┘ │ │ ┌─────────┐ ┌─────────┐ │ │ │ Pose │ │Head Pose│ │ │ └─────────┘ └─────────┘ │ └────────┬────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ Temporal Stability Layer │ │ (Sliding Window Validation) │ └────────┬────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ Async Event Bus │ │ (State Machine Fusion) │ └────────┬────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ Your Application / HUD │ └─────────────────────────────────┘

📊 Tracked Metrics

Hands (HandResult)

MetricTypeDescription
gesture_namesList[str]Recognized gesture per hand (fist, open_palm, peace, thumbs_up, pointing, etc.)
finger_countsList[int]Extended finger count per hand
finger_statesList[List[bool]]Per-finger extended state [thumb, index, middle, ring, pinky]
hand_opennessList[float]Hand openness 0-1 per hand
wrist_anglesList[float]Wrist rotation in degrees
palm_directionsList[str]Palm facing direction
handednessList[str]Left/Right per hand

Face (FaceResult)

MetricTypeDescription
emotionstrDetected emotion: happy, surprised, angry, sad, neutral
smile_scorefloatSmile intensity 0-1
jaw_openfloatMouth openness 0-1
eyebrow_scorefloatRaised eyebrows (surprise indicator)
blink_left/rightboolEye blink states
eye_wide_left/rightfloatEye wideness (surprise)
brow_down_left/rightfloatFrown indicators
cheek_pufffloatCheek puff score
lip_puckerfloatLip pucker score

Body (PoseResult)

MetricTypeDescription
posture_scorefloatOverall posture quality 0-1
shoulder_anglefloatShoulder tilt in degrees
torso_leanfloatBody lean left/right in degrees
left/right_elbow_anglefloatElbow bend angles
arm_opennessfloatHow spread the arms are 0-1
hands_above_shouldersList[str]Which hands are raised
body_centertupleTorso center (x, y)

Attention (HeadPoseResult)

MetricTypeDescription
pitchfloatUp/down rotation (nodding)
yawfloatLeft/right rotation (shaking)
rollfloatHead tilt
looking_at_screenboolUser focused on screen
attention_directionstrcenter, left, right, up, down

🎯 Design Principles

  1. No Cloud Dependencies: Everything runs locally
  2. No Vision LLMs: Uses lightweight, specialized models only
  3. Temporal Stability Required: No single-frame conclusions
  4. Non-Blocking: Inference threads never block main loop
  5. Event-Driven: Push-based, not polling
  6. Frame Drops Allowed: Prioritize latency over processing every frame

📁 Project Structure

jarvis-perception/ ├── jarvis_perception/ │ ├── main.py # Entry point │ ├── config.py # Configuration │ ├── vision/ # Detectors │ │ ├── hands_detector.py │ │ ├── face_detector.py │ │ ├── pose_detector.py │ │ └── head_pose_estimator.py │ ├── fusion/ # State machines │ │ ├── gesture_state.py │ │ └── attention_state.py │ ├── core/ # Event system │ │ └── event_bus.py │ ├── capture/ # Camera capture │ │ └── camera_capture.py │ └── debug/ # Visualization │ └── visualizer.py ├── models/ # MediaPipe model files ├── requirements.txt └── README.md

🔧 Configuration

Edit config.py to customize:

# Vision detection confidence vision.hand_detection_confidence = 0.7 vision.face_detection_confidence = 0.7 # Head pose thresholds (degrees) vision.yaw_threshold = 15.0 # Left/right sensitivity vision.pitch_threshold = 35.0 # Up/down sensitivity (relaxed for laptops) # Temporal stability stability.finger_window_size = 5 stability.attention_debounce_ms = 300

📄 License

MIT License - feel free to use in your projects!

🤝 Contributing

Contributions welcome! Please feel free to submit a Pull Request.

🔗 Links


Built with ❤️ for real-time human perception research

About

> Ultra-low latency real-time human perception using MediaPipe on macOS (Apple Silicon)

Language
Python99%
Shell1.1%