Ultra-low latency real-time human perception using MediaPipe on macOS (Apple Silicon)
A pure vision demonstration that detects and analyzes human presence in real-time with comprehensive metrics:
| Module | Metrics Tracked |
|---|---|
| Hands | Gesture recognition (peace, thumbs_up, fist, pointing, etc.), finger count, finger states, hand openness, wrist angles, palm direction |
| Face | Emotion detection (happy, surprised, angry, sad, neutral), smile/jaw/brow scores, eye blink states, 20+ blendshape metrics |
| Body | Posture score, shoulder angles, torso lean, elbow angles, arm positions, arm openness, hands raised detection |
| Attention | Focus state (focused/distracted/absent), gaze direction, head pitch/yaw/roll angles |
All processing runs locally with sub-100ms latency on Apple Silicon.
Run with --debug to see the real-time visualization:
python -m jarvis_perception.main --debug
The HUD displays:
# Clone the repository
git clone https://github.com/quyangminddock/jarvis-perception.git
cd jarvis-perception
# Create virtual environment
python3 -m venv venv
source venv/bin/activate
# Install dependencies
pip install -r requirements.txt
# Download MediaPipe models (required)
mkdir -p models
# Download from: https://developers.google.com/mediapipe/solutions/vision/hand_landmarker
# - hand_landmarker.task
# - face_landmarker.task
# - pose_landmarker.task
# Run with debug visualization (recommended)
python -m jarvis_perception.main --debug
# Run without visualization
python -m jarvis_perception.main
# Specify camera
python -m jarvis_perception.main --debug --camera 1
| Option | Description |
|---|---|
--debug, -d | Enable debug visualization window |
--camera, -c | Camera device ID (default: 0) |
--width | Frame width (default: 640) |
--height | Frame height (default: 480) |
--verbose, -v | Enable verbose logging |
┌─────────────────┐ │ Camera │ └────────┬────────┘ │ ▼ ┌─────────────────────────────────┐ │ Non-Blocking Ring Buffer │ │ (Dedicated Thread) │ └────────┬────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ Vision Thread Pool │ │ ┌─────────┐ ┌─────────┐ │ │ │ Hands │ │ Face │ │ │ └─────────┘ └─────────┘ │ │ ┌─────────┐ ┌─────────┐ │ │ │ Pose │ │Head Pose│ │ │ └─────────┘ └─────────┘ │ └────────┬────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ Temporal Stability Layer │ │ (Sliding Window Validation) │ └────────┬────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ Async Event Bus │ │ (State Machine Fusion) │ └────────┬────────────────────────┘ │ ▼ ┌─────────────────────────────────┐ │ Your Application / HUD │ └─────────────────────────────────┘
| Metric | Type | Description |
|---|---|---|
gesture_names | List[str] | Recognized gesture per hand (fist, open_palm, peace, thumbs_up, pointing, etc.) |
finger_counts | List[int] | Extended finger count per hand |
finger_states | List[List[bool]] | Per-finger extended state [thumb, index, middle, ring, pinky] |
hand_openness | List[float] | Hand openness 0-1 per hand |
wrist_angles | List[float] | Wrist rotation in degrees |
palm_directions | List[str] | Palm facing direction |
handedness | List[str] | Left/Right per hand |
| Metric | Type | Description |
|---|---|---|
emotion | str | Detected emotion: happy, surprised, angry, sad, neutral |
smile_score | float | Smile intensity 0-1 |
jaw_open | float | Mouth openness 0-1 |
eyebrow_score | float | Raised eyebrows (surprise indicator) |
blink_left/right | bool | Eye blink states |
eye_wide_left/right | float | Eye wideness (surprise) |
brow_down_left/right | float | Frown indicators |
cheek_puff | float | Cheek puff score |
lip_pucker | float | Lip pucker score |
| Metric | Type | Description |
|---|---|---|
posture_score | float | Overall posture quality 0-1 |
shoulder_angle | float | Shoulder tilt in degrees |
torso_lean | float | Body lean left/right in degrees |
left/right_elbow_angle | float | Elbow bend angles |
arm_openness | float | How spread the arms are 0-1 |
hands_above_shoulders | List[str] | Which hands are raised |
body_center | tuple | Torso center (x, y) |
| Metric | Type | Description |
|---|---|---|
pitch | float | Up/down rotation (nodding) |
yaw | float | Left/right rotation (shaking) |
roll | float | Head tilt |
looking_at_screen | bool | User focused on screen |
attention_direction | str | center, left, right, up, down |
jarvis-perception/ ├── jarvis_perception/ │ ├── main.py # Entry point │ ├── config.py # Configuration │ ├── vision/ # Detectors │ │ ├── hands_detector.py │ │ ├── face_detector.py │ │ ├── pose_detector.py │ │ └── head_pose_estimator.py │ ├── fusion/ # State machines │ │ ├── gesture_state.py │ │ └── attention_state.py │ ├── core/ # Event system │ │ └── event_bus.py │ ├── capture/ # Camera capture │ │ └── camera_capture.py │ └── debug/ # Visualization │ └── visualizer.py ├── models/ # MediaPipe model files ├── requirements.txt └── README.md
Edit config.py to customize:
# Vision detection confidence
vision.hand_detection_confidence = 0.7
vision.face_detection_confidence = 0.7
# Head pose thresholds (degrees)
vision.yaw_threshold = 15.0 # Left/right sensitivity
vision.pitch_threshold = 35.0 # Up/down sensitivity (relaxed for laptops)
# Temporal stability
stability.finger_window_size = 5
stability.attention_debounce_ms = 300
MIT License - feel free to use in your projects!
Contributions welcome! Please feel free to submit a Pull Request.
Built with ❤️ for real-time human perception research