VoxCPM2 is a tokenizer-free, diffusion autoregressive Text-to-Speech model — 2B parameters, 30 languages, 48kHz audio output, trained on over 2 million hours of multilingual speech data.
Arabic, Burmese, Chinese, Danish, Dutch, English, Finnish, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Khmer, Korean, Lao, Malay, Norwegian, Polish, Portuguese, Russian, Spanish, Swahili, Swedish, Tagalog, Thai, Turkish, Vietnamese
Chinese Dialects: 四川话, 粤语, 吴语, 东北话, 河南话, 陕西话, 山东话, 天津话, 闽南话
pip install voxcpm
Requirements: Python ≥ 3.10, PyTorch ≥ 2.5.0, CUDA ≥ 12.0 · Full Quick Start →
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained("openbmb/VoxCPM2", load_denoiser=False)
wav = model.generate(
text="VoxCPM2 brings multilingual support, creative voice design, and controllable voice cloning.",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("output.wav", wav, model.tts_model.sample_rate)
Put the voice description in parentheses at the start of text, followed by the content to synthesize:
wav = model.generate(
text="(A young woman, gentle and sweet voice)Hello, welcome to VoxCPM2!",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
# Basic cloning
wav = model.generate(
text="This is a cloned voice generated by VoxCPM2.",
reference_wav_path="speaker.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)
# Cloning with style control
wav = model.generate(
text="(slightly faster, cheerful tone)This is a cloned voice with style control.",
reference_wav_path="speaker.wav",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
Provide both the reference audio and its exact transcript for maximum fidelity. Pass the same clip to both reference_wav_path and prompt_wav_path for highest similarity:
wav = model.generate(
text="This is an ultimate cloning demonstration using VoxCPM2.",
prompt_wav_path="speaker_reference.wav",
prompt_text="The transcript of the reference audio.",
reference_wav_path="speaker_reference.wav",
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
import numpy as np
chunks = []
for chunk in model.generate_streaming(text="Streaming is easy with VoxCPM!"):
chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)
| Property | Value |
|---|---|
| Architecture | Tokenizer-free Diffusion Autoregressive (LocEnc → TSLM → RALM → LocDiT) |
| Backbone | Based on MiniCPM-4, totally 2B parameters |
| Audio VAE | AudioVAE V2 (asymmetric encode/decode, 16kHz in → 48kHz out) |
| Training Data | 2M+ hours multilingual speech |
| LM Token Rate | 6.25 Hz |
| Max Sequence Length | 8192 tokens |
| dtype | bfloat16 |
| VRAM | ~8 GB |
| RTF (RTX 4090) | ~0.30 (standard) / ~0.13 (Nano-vLLM) |
VoxCPM2 achieves state-of-the-art or competitive results on major zero-shot and controllable TTS benchmarks.
See the GitHub repo for full benchmark tables (Seed-TTS-eval, CV3-eval, InstructTTSEval, MiniMax Multilingual Test).
VoxCPM2 supports both full SFT and LoRA fine-tuning with as little as 5–10 minutes of audio:
# LoRA fine-tuning (recommended)
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
# Full fine-tuning
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
See the Fine-tuning Guide for full instructions.
@article{voxcpm2_2026, title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning}, author = {VoxCPM Team}, journal = {GitHub}, year = {2026}, } @article{voxcpm2025, title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning}, author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan}, journal = {arXiv preprint arXiv:2509.24650}, year = {2025}, }
Released under the Apache-2.0 license, free for commercial use. For production deployments, we recommend thorough testing and safety evaluation tailored to your use case.