logo
0
0
WeChat Login

MiniCPM-o 4.5 PyTorch Simple Demo System

中文简介 | Detailed Documentation

Ready-to-use Demo Website

This demo system is officially provided by the MiniCPM-o 4.5 model training team. It uses a PyTorch + CUDA inference backend, combined with a lightweight frontend-backend design, aiming to demonstrate the full audio-video omnimodal full-duplex capabilities of MiniCPM-o 4.5 in a transparent, concise, and lossless manner.

ModeFeaturesI/O ModalitiesParadigm
Turn-based ChatLow-latency streaming interaction; requires button or VAD (Voice Activity Detection) to trigger responses; high response accuracy; strong basic capabilitiesAudio + Text input, Audio + Text outputTurn-based
Omnimodal Full-DuplexReal-time omnimodal full-duplex interaction; visual and voice input with simultaneous voice output; model autonomously decides when to speak; powerful cutting-edge capabilitiesVision + Audio input, Text + Voice outputFull-duplex
Audio Full-DuplexReal-time audio full-duplex interaction; voice input and voice output happen simultaneously; model autonomously decides when to speak; powerful cutting-edge capabilitiesAudio input, Text + Voice outputFull-duplex

The 3 currently supported modes share a single model instance with millisecond-level hot-switching (< 0.1ms). More modes will be supported soon.

Additional features:

  • Customizable system prompts
  • Customizable reference audio
  • Simple and readable codebase for continual development
  • Serve as API backend for third-party applications

Demo Preview

Architecture

Frontend (HTML/JS) | HTTPS / WSS Gateway (:8006, HTTPS) | HTTP / WS (internal) Worker Pool (:22400+) +-- Worker 0 (GPU 0) +-- Worker 1 (GPU 1) +-- ...
  • Frontend — Mode selection homepage, Turn-based Chat, Omni / Audio Duplex full-duplex interaction, Admin Dashboard
  • Gateway — Request routing and dispatching, WebSocket proxy, request queuing and session affinity
  • Worker — Each Worker occupies one GPU exclusively, supports Turn-based Chat / Duplex protocols, Duplex supports pause/resume (auto-release on timeout)

Quick Start

Check System Requirements

  1. Make sure you have an NVIDIA GPU with more than 28GB of VRAM.
  2. Make sure your machine is running a Linux operating system.

Install FFmpeg

FFmpeg is required for video frame extraction and inference result visualization. For more information, visit the official FFmpeg website.

macOS (Homebrew):

brew install ffmpeg

Ubuntu/Debian:

sudo apt update && sudo apt install ffmpeg

Verify installation:

ffmpeg -version

Deployment Steps

1. Install Python 3.10

We recommend using miniconda to install Python 3.10.

mkdir -p ./miniconda3_install_tmp # Download the miniconda3 installation script wget https://repo.anaconda.com/miniconda/Miniconda3-py310_25.11.1-1-Linux-x86_64.sh -O ./miniconda3_install_tmp/miniconda.sh # Install miniconda3 into the project directory bash ./miniconda3_install_tmp/miniconda.sh -b -u -p ./miniconda3

After installation, you will have an empty base environment. Activate this base environment, which uses Python 3.10 by default.

source ./miniconda3/bin/activate python --version # Should display 3.10.x

2. Install Dependencies for MiniCPM-o 4.5

Using the install.sh script in the project directory is the fastest way. It creates a venv virtual environment named base under .venv in the project directory and installs all dependencies.

source ./miniconda3/bin/activate bash ./install.sh

If you have a good network connection, the entire installation process takes about 5 minutes. If you are in China, consider using a third-party PyPI mirror such as the Tsinghua mirror.

Click to expand manual installation steps

You can also install dependencies manually in 2 steps:

# First, prepare an empty Python 3.10 environment source ./miniconda3/bin/activate python -m venv .venv/base source .venv/base/bin/activate # Install PyTorch pip install "torch==2.8.0" "torchaudio==2.8.0" # Install the remaining dependencies pip install -r requirements.txt

3. Create Configuration File

Copy config.example.json to config.json in the project directory.

cp config.example.json config.json

The model path (model_path) defaults to openbmb/MiniCPM-o-4_5. If you have access to Hugging Face, no modification is needed — the model will be automatically pulled from Hugging Face.

Click to expand detailed instructions about model path

(Optional) If you prefer to download model weights to a fixed location, or cannot access Hugging Face, you can modify model_path to your local model path.

# Install huggingface cli pip install -U huggingface_hub # Download the model huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir /path/to/your/MiniCPM-o-4_5

If you cannot access Hugging Face, you can use the following two methods to download the model in advance.

  • Download the model using hf-mirror
pip install -U huggingface_hub export HF_ENDPOINT=https://hf-mirror.com huggingface-cli download openbmb/MiniCPM-o-4_5 --local-dir /path/to/your/MiniCPM-o-4_5
  • Download the model using ModelScope
pip install modelscope modelscope download --model OpenBMB/MiniCPM-o-4_5 --local_dir /path/to/your/MiniCPM-o-4_5

Modify "gateway_port": 8006 to change the deployment port. The default is 8006.

4. Start the Service

CUDA_VISIBLE_DEVICES=0,1,2,3 bash start_all.sh

After the service starts, visit https://localhost:8006. The self-signed certificate will trigger a browser warning — click "Advanced" → "Proceed" to continue.

Click to expand detailed instructions about startup options

The following are advanced startup options, currently for developer reference.

CUDA_VISIBLE_DEVICES=0,1 bash start_all.sh # Specify GPUs bash start_all.sh --compile # torch.compile acceleration (experimental, unstable) bash start_all.sh --http # Downgrade to HTTP (not recommended, mic/camera APIs require HTTPS)

Manual Startup (step by step):

# Worker (one per GPU) CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. .venv/base/bin/python worker.py --worker-index 0 --gpu-id 0 # Gateway PYTHONPATH=. .venv/base/bin/python gateway.py --port 10024 --workers localhost:22400

5. Stop the Service:

pkill -f "gateway.py|worker.py"


Known Issues and Improvement Plans

  • In Turn-based Chat mode, image input is temporarily unavailable — only audio and text input are supported. An image Q&A mode will be split out soon.
  • Half-duplex voice call (no button required to trigger responses) is under development and will be merged soon.
  • In Audio Full-Duplex mode, echo cancellation currently has issues affecting interruption success rate. Using headphones is recommended. A fix is coming soon.
  • In voice mode, due to the model's training strategy, Chinese and English calls require corresponding language system prompts.

Project Structure

Project Code Structure

minicpmo45_service/ ├── config.json # Service config (copied from config.example.json, gitignored) ├── config.example.json # Config example (full fields + defaults) ├── config.py # Config loading logic (Pydantic definition + JSON loading) ├── requirements.txt # Python dependencies ├── start_all.sh # One-click startup script │ ├── gateway.py # Gateway (routing, queuing, WS proxy) ├── worker.py # Worker (inference service) ├── gateway_modules/ # Gateway business modules │ ├── core/ # Core encapsulation │ ├── schemas/ # Pydantic schemas (request/response) │ └── processors/ # Inference processors (UnifiedProcessor) │ ├── MiniCPMO45/ # Model core inference code ├── static/ # Frontend pages ├── resources/ # Resource files (reference audio, etc.) ├── tests/ # Tests └── tmp/ # Runtime logs and PID files

Frontend Routes

PageURL
Non-streaminghttps://localhost:8006
Omnimodal Full-Duplexhttps://localhost:8006/omni
Audio Full-Duplexhttps://localhost:8006/audio_duplex
Dashboardhttps://localhost:8006/admin
API Docshttps://localhost:8006/docs


Configuration

config.json — Unified Configuration File

All configurations are centralized in config.json (copied from config.example.json). config.json is gitignored and will not be committed.

Configuration Priority: CLI arguments > config.json > Pydantic defaults

GroupFieldDefaultDescription
modelmodel_path(required)HuggingFace format model directory
modelpt_pathnullAdditional .pt weight override
modelattn_implementation"auto"Attention implementation: "auto"/"flash_attention_2"/"sdpa"/"eager"
audioref_audio_pathassets/ref_audio/ref_minicpm_signature.wavDefault TTS reference audio
audioplayback_delay_ms200Frontend audio playback delay (ms); higher = smoother but more latency
audiochat_vocoder"token2wav"Chat mode vocoder: "token2wav" (default) or "cosyvoice2"
servicegateway_port8006Gateway port
serviceworker_base_port22400Worker base port
servicemax_queue_size100Maximum queued requests
servicerequest_timeout300.0Request timeout (seconds)
servicecompilefalsetorch.compile acceleration
servicedata_dir"data"Data directory
duplexpause_timeout60.0Duplex pause timeout (seconds)

Minimal Configuration (only model path required):

{"model": {"model_path": "/path/to/model"}}

CLI Argument Overrides

# Worker python worker.py --model-path /alt/model --pt-path /alt/weights.pt --ref-audio-path /alt/ref.wav --compile # Gateway python gateway.py --port 10025 --workers localhost:22400,localhost:22401 --http

Resource Consumption

ResourceToken2Wav (default)
VRAM (per Worker, after initialization)~21.5 GB
Model loading time~16s
Mode switching latency< 0.1ms

Compile mode incurs an additional ~60s compilation time on the first inference.

Testing

# Schema unit tests (no GPU required) PYTHONPATH=. .venv/base/bin/python -m pytest tests/test_schemas.py -v # Processor tests (GPU required) CUDA_VISIBLE_DEVICES=0 PYTHONPATH=. .venv/base/bin/python -m pytest tests/test_chat.py tests/test_streaming.py tests/test_duplex.py -v -s # API integration tests (service must be running) PYTHONPATH=. .venv/base/bin/python -m pytest tests/test_api.py -v -s

About

https://github.com/OpenBMB/minicpm-o-4_5-pytorch-simple-demo

Language
Python59.3%
JavaScript22.5%
HTML15.3%
CSS2.1%
Others0.8%