Iron Man Style AR Assistant - Powered by Multimodal AI
中文文档 | English (Active Protocol)
Jarvis WorldSense transforms your computer into a fully interactive AI assistant with an Iron Man Heads-Up Display (HUD). It sees what you see, hears what you say, and responds instantly.
git clone https://github.com/quyangminddock/LLaVA-WorldSense.git
cd LLaVA-WorldSense
conda create -n jarvis python=3.10 -y conda activate jarvis
pip install -r requirements.txt
Note: If you want to use the legacy
LLaVA-1.57B/13B models, you will need to install the original LLaVA package separately. For the default TinyLLaVA experience, this is NOT required.
brew install portaudio pip install pyaudio
Recommended Mode (TinyLLaVA + Web UI):
python main.py --llava-model tinyllava/TinyLLaVA-Phi-2-SigLIP-3.1B --web
Access the Interface: Open http://localhost:8080 in your Chrome/Safari browser.
Note: The first run will download the TinyLLaVA model (~6GB). Please be patient.
| Action | Control |
|---|---|
| Toggle HUD | Click "Toggle Camera" to engage AR systems. |
| Voice Command | Click "🎙️" once. Continuous mode stays active. |
| Snap & Analyze | Say "What do you see?" for a quick scan. |
| Deep Scan | Say "Tell me details" for a full analysis. |
Jarvis runs on TinyLLaVA-Phi-2-SigLIP-3.1B, a state-of-the-art small multimodal model that punches above its weight.
| Component | Tech Stack | Benefit |
|---|---|---|
| Vision Encoder | SigLIP-384 | Superior to CLIP. Understands fine-grained details and text in images better. |
| Language Core | Microsoft Phi-2 | A 2.7B reasoning powerhouse. Performs mathematically and logically on par with much larger models. |
| Connector | MLP Projection | Efficiently translates visual features into language tokens. |
portaudio is installed via Homebrew.We welcome Stark Industries engineers! Please read CONTRIBUTING.md.
MIT License. Built for the future of AI interaction.
"Sometimes you gotta run before you can walk."