logo
1
0
WeChat Login
quyang<jintli@qq.com>
remove LLaVA

🛡️ Jarvis WorldSense

Jarvis WorldSense

Iron Man Style AR Assistant - Powered by Multimodal AI

中文文档 | English (Active Protocol)


✨ Mission Briefing

Jarvis WorldSense transforms your computer into a fully interactive AI assistant with an Iron Man Heads-Up Display (HUD). It sees what you see, hears what you say, and responds instantly.

🦾 Key Capabilities

  • 👁️ AR Vision (HUD): Real-time Iron Man interface with eye-tracking targeting systems.
  • 🧠 Kinetic Intelligence: Powered by TinyLLaVA-3.1B, optimized for consumer hardware (Run on 8GB RAM!).
  • 🗣️ Natural Voice: "Always-On" continuous voice conversation (English Only Protocol).
  • 🛡️ Pure Vision: AI sees the raw world, while you enjoy the AR overlay (Vision Separation Technology).

🚀 Quick Start

Prerequisites

  • Python 3.10+
  • macOS (M1/M2/M3) or NVIDIA GPU
  • 8GB+ RAM (Thanks to TinyLLaVA)
  • Webcam & Microphone

Installation

  1. Clone the repository
git clone https://github.com/quyangminddock/LLaVA-WorldSense.git cd LLaVA-WorldSense
  1. Create virtual environment
conda create -n jarvis python=3.10 -y conda activate jarvis
  1. Install Dependencies
pip install -r requirements.txt

Note: If you want to use the legacy LLaVA-1.5 7B/13B models, you will need to install the original LLaVA package separately. For the default TinyLLaVA experience, this is NOT required.

  1. Install Audio Drivers (macOS)
brew install portaudio pip install pyaudio

🎮 Launch Jarvis

Recommended Mode (TinyLLaVA + Web UI):

python main.py --llava-model tinyllava/TinyLLaVA-Phi-2-SigLIP-3.1B --web

Access the Interface: Open http://localhost:8080 in your Chrome/Safari browser.

Note: The first run will download the TinyLLaVA model (~6GB). Please be patient.

🕹️ Controls

ActionControl
Toggle HUDClick "Toggle Camera" to engage AR systems.
Voice CommandClick "🎙️" once. Continuous mode stays active.
Snap & AnalyzeSay "What do you see?" for a quick scan.
Deep ScanSay "Tell me details" for a full analysis.

🧠 The Brain: TinyLLaVA-3.1B

Jarvis runs on TinyLLaVA-Phi-2-SigLIP-3.1B, a state-of-the-art small multimodal model that punches above its weight.

Why this model?

ComponentTech StackBenefit
Vision EncoderSigLIP-384Superior to CLIP. Understands fine-grained details and text in images better.
Language CoreMicrosoft Phi-2A 2.7B reasoning powerhouse. Performs mathematically and logically on par with much larger models.
ConnectorMLP ProjectionEfficiently translates visual features into language tokens.

Performance Stats

  • Memory Footprint: ~4GB VRAM (FP16) / ~6GB RAM (MPS/CPU).
  • Inference Speed: Real-time conversational latency on M1/M2/M3 chips.
  • Capabilities: Strong at object recognition, OCR (reading text), and spatial reasoning.

🔧 Troubleshooting

  • Audio Error: Ensure portaudio is installed via Homebrew.
  • AR Misalignment: The HUD uses a mirror effect. Ensure you are facing the camera directly.
  • Model Load Fail: Check your internet connection for HuggingFace downloads.

🤝 Contributing

We welcome Stark Industries engineers! Please read CONTRIBUTING.md.

📄 License

MIT License. Built for the future of AI interaction.


"Sometimes you gotta run before you can walk."