English | 中文
👋 欢迎加入社区,参与讨论与交流!
飞书群
|
Discord
VoxCPM 是一个无离散音频分词器(Tokenizer-Free)的语音合成系统,通过端到端的扩散自回归架构直接生成连续语音表征,绕过对音频的离散编码步骤,实现高度自然且富有表现力的语音合成。
VoxCPM2 是最新的版本 — 基于 MiniCPM-4 基座构建,总计 20亿 参数,在超过 200万小时 的多语种音频数据上训练,支持 30种全球语言+9种中文方言、音色设计、可控声音克隆,原生输出 48kHz 高质量音频。
中国方言:四川话、粤语、吴语、东北话、河南话、陕西话、山东话、天津话、闽南话
pip install voxcpm
环境要求: Python ≥ 3.10,PyTorch ≥ 2.5.0,CUDA ≥ 12.0。详见 快速开始文档。
from voxcpm import VoxCPM
import soundfile as sf
model = VoxCPM.from_pretrained(
"openbmb/VoxCPM2"
load_denoiser=False,
)
wav = model.generate(
text="VoxCPM2 是目前推荐使用的多语言语音合成版本。",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("demo.wav", wav, model.tts_model.sample_rate)
print("已保存: demo.wav")
用自然语言描述创建全新音色,无需参考音频。格式: 在 text 开头用括号写入音色描述(如 "(音色描述)要合成的文本。"):
wav = model.generate(
text="(年轻女性,声音温柔甜美)你好,欢迎使用VoxCPM2!",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("voice_design.wav", wav, model.tts_model.sample_rate)
上传一段参考音频,模型克隆其音色,同时可以使用控制指令调节语速、情绪或风格。
wav = model.generate(
text="这是VoxCPM2生成的克隆语音。",
reference_wav_path="path/to/voice.wav",
)
sf.write("clone.wav", wav, model.tts_model.sample_rate)
wav = model.generate(
text="(稍快一点,欢快的语气)这是带风格控制的克隆语音。",
reference_wav_path="path/to/voice.wav",
cfg_value=2.0,
inference_timesteps=10,
)
sf.write("controllable_clone.wav", wav, model.tts_model.sample_rate)
提供参考音频及其精确文本转录,实现基于音频续写的高保真克隆。为获得最高克隆相似度,可将同一音频同时传给 reference_wav_path 和 prompt_wav_path:
wav = model.generate(
text="这是使用VoxCPM2的极致克隆演示。",
prompt_wav_path="path/to/voice.wav",
prompt_text="参考音频的文本转录。",
reference_wav_path="path/to/voice.wav", # 可选,提升相似度
)
sf.write("hifi_clone.wav", wav, model.tts_model.sample_rate)
import numpy as np
chunks = []
for chunk in model.generate_streaming(
text="使用VoxCPM进行流式语音合成非常简单!",
):
chunks.append(chunk)
wav = np.concatenate(chunks)
sf.write("streaming.wav", wav, model.tts_model.sample_rate)
# 音色设计(无需参考音频)
voxcpm design \
--text "VoxCPM2带来全新语音合成体验。" \
--output out.wav
# 可控声音克隆(带风格控制)
voxcpm design \
--text "VoxCPM2带来全新语音合成体验。" \
--control "年轻女声,温暖温柔,略带微笑" \
--output out.wav
# 声音克隆(参考音频)
voxcpm clone \
--text "这是一个声音克隆的演示。" \
--reference-audio path/to/voice.wav \
--output out.wav
# 极致克隆(提示音频 + 转录文本)
voxcpm clone \
--text "这是一个声音克隆的演示。" \
--prompt-audio path/to/voice.wav \
--prompt-text "参考音频转录文本" \
--reference-audio path/to/voice.wav \
--output out.wav
# 批量处理
voxcpm batch --input examples/input.txt --output-dir outs
# 帮助
voxcpm --help
python app.py # 然后打开 http://localhost:7860
如需高吞吐量部署,使用 Nano-vLLM-VoxCPM — 基于 Nano-vLLM 构建的专用推理引擎,支持并发请求和异步 API。
pip install nano-vllm-voxcpm
from nanovllm_voxcpm import VoxCPM
import numpy as np, soundfile as sf
server = VoxCPM.from_pretrained(model="/path/to/VoxCPM", devices=[0])
chunks = list(server.generate(target_text="你好,我来自VoxCPM!"))
sf.write("out.wav", np.concatenate(chunks), 48000)
server.stop()
在 NVIDIA RTX 4090 上 RTF 低至 ~0.13(标准 PyTorch 实现约 ~0.3),支持批量并发请求和 FastAPI HTTP 服务。详见 Nano-vLLM-VoxCPM 仓库。
| VoxCPM2 | VoxCPM1.5 | VoxCPM-0.5B | |
|---|---|---|---|
| 状态 | 🟢 最新版本 | 稳定版 | 旧版 |
| 主模型参数量 | 2B | 0.6B | 0.5B |
| 音频采样率 | 48kHz | 44.1kHz | 16kHz |
| LM处理码率 | 6.25Hz | 6.25Hz | 12.5Hz |
| 语言支持数量 | 30 | 2(中文、英文) | 2(中文、英文) |
| 克隆模式 | 隔离参考音频(无需文本) & 音频续写 | 仅音频续写 | 仅音频续写 |
| 音色设计 | ✅ | — | — |
| 可控声音克隆 | ✅ | — | — |
| SFT / LoRA | ✅ | ✅ | ✅ |
| RTF (RTX 4090) | ~0.30 | ~0.15 | ~0.17 |
| RTF Nano-VLLM (RTX 4090) | ~0.13 | ~0.08 | ~0.10 |
| 显存占用 | ~8 GB | ~6 GB | ~5 GB |
| 模型权重 | 🤗 HF / MS | 🤗 HF / MS | 🤗 HF / MS |
| 技术报告 | 即将发布 | — | arXiv ICLR 2026 |
| Demo 页面 | 音频示例 | — | 音频示例 |
VoxCPM2 采用连续音频表征、扩散自回归范式,模型在 AudioVAE 的连续隐空间中通过四阶段处理:LocEnc → TSLM → RALM → LocDiT,实现丰富的表现力语音合成和 48kHz 原生音频输出。
完整架构细节、VoxCPM2 升级内容和模型对比表见 架构设计文档。
VoxCPM2 在公开的零样本和可控 TTS 基准测试中取得了 SOTA 或可比的结果。
| Model | Parameters | Open-Source | test-EN | test-ZH | test-Hard | |||
|---|---|---|---|---|---|---|---|---|
| WER/%⬇ | SIM/%⬆ | CER/%⬇ | SIM/%⬆ | CER/%⬇ | SIM/%⬆ | |||
| MegaTTS3 | 0.5B | ❌ | 2.79 | 77.1 | 1.52 | 79.0 | - | - |
| DiTAR | 0.6B | ❌ | 1.69 | 73.5 | 1.02 | 75.3 | - | - |
| CosyVoice3 | 0.5B | ❌ | 2.02 | 71.8 | 1.16 | 78.0 | 6.08 | 75.8 |
| CosyVoice3 | 1.5B | ❌ | 2.22 | 72.0 | 1.12 | 78.1 | 5.83 | 75.8 |
| Seed-TTS | - | ❌ | 2.25 | 76.2 | 1.12 | 79.6 | 7.59 | 77.6 |
| MiniMax-Speech | - | ❌ | 1.65 | 69.2 | 0.83 | 78.3 | - | - |
| F5-TTS | 0.3B | ✅ | 2.00 | 67.0 | 1.53 | 76.0 | 8.67 | 71.3 |
| MaskGCT | 1B | ✅ | 2.62 | 71.7 | 2.27 | 77.4 | - | - |
| CosyVoice | 0.3B | ✅ | 4.29 | 60.9 | 3.63 | 72.3 | 11.75 | 70.9 |
| CosyVoice2 | 0.5B | ✅ | 3.09 | 65.9 | 1.38 | 75.7 | 6.83 | 72.4 |
| SparkTTS | 0.5B | ✅ | 3.14 | 57.3 | 1.54 | 66.0 | - | - |
| FireRedTTS | 0.5B | ✅ | 3.82 | 46.0 | 1.51 | 63.5 | 17.45 | 62.1 |
| FireRedTTS-2 | 1.5B | ✅ | 1.95 | 66.5 | 1.14 | 73.6 | - | - |
| Qwen2.5-Omni | 7B | ✅ | 2.72 | 63.2 | 1.70 | 75.2 | 7.97 | 74.7 |
| Qwen3-Omni | 30B-A3B | ✅ | 1.39 | - | 1.07 | - | - | - |
| OpenAudio-s1-mini | 0.5B | ✅ | 1.94 | 55.0 | 1.18 | 68.5 | 23.37 | 64.3 |
| IndexTTS2 | 1.5B | ✅ | 2.23 | 70.6 | 1.03 | 76.5 | 7.12 | 75.5 |
| VibeVoice | 1.5B | ✅ | 3.04 | 68.9 | 1.16 | 74.4 | - | - |
| HiggsAudio-v2 | 3B | ✅ | 2.44 | 67.7 | 1.50 | 74.0 | 55.07 | 65.6 |
| VoxCPM-0.5B | 0.6B | ✅ | 1.85 | 72.9 | 0.93 | 77.2 | 8.87 | 73.0 |
| VoxCPM1.5 | 0.8B | ✅ | 2.12 | 71.4 | 1.18 | 77.0 | 7.74 | 73.1 |
| MOSS-TTS | ✅ | 1.85 | 73.4 | 1.20 | 78.8 | - | - | |
| Qwen3-TTS | 1.7B | ✅ | 1.23 | 71.7 | 1.22 | 77.0 | 6.76 | 74.8 |
| FishAudio S2 | 4B | ✅ | 0.99 | - | 0.54 | - | 5.99 | - |
| LongCat-Audio-DiT | 3.5B | ✅ | 1.50 | 78.6 | 1.09 | 81.8 | 6.04 | 79.7 |
| VoxCPM2 | 2B | ✅ | 1.84 | 75.3 | 0.97 | 79.5 | 8.13 | 75.3 |
| Model | zh | en | hard-zh | hard-en | ja | ko | de | es | fr | it | ru |
|---|---|---|---|---|---|---|---|---|---|---|---|
| CosyVoice2 | 4.08 | 6.32 | 12.58 | 11.96 | 9.13 | 19.7 | - | - | - | - | - |
| CosyVoice3-1.5B | 3.91 | 4.99 | 9.77 | 10.55 | 7.57 | 5.69 | 6.43 | 4.47 | 11.8 | 10.5 | 6.64 |
| Fish Audio S2 | 2.65 | 2.43 | 9.10 | 4.40 | 3.96 | 2.76 | 2.22 | 2.00 | 6.26 | 2.04 | 2.78 |
| VoxCPM2 | 3.65 | 5.00 | 8.55 | 8.48 | 5.96 | 5.69 | 4.77 | 3.80 | 9.85 | 4.25 | 5.21 |
| Language | Minimax | ElevenLabs | Qwen3-TTS | FishAudio S2 | VoxCPM2 |
|---|---|---|---|---|---|
| Arabic | 1.665 | 1.666 | – | 3.500 | 13.046 |
| Cantonese | 34.111 | 51.513 | – | 30.670 | 38.584 |
| Chinese | 2.252 | 16.026 | 0.928 | 0.730 | 1.136 |
| Czech | 3.875 | 2.108 | – | 2.840 | 24.132 |
| Dutch | 1.143 | 0.803 | – | 0.990 | 0.913 |
| English | 2.164 | 2.339 | 0.934 | 1.620 | 2.289 |
| Finnish | 4.666 | 2.964 | – | 3.330 | 2.632 |
| French | 4.099 | 5.216 | 2.858 | 3.050 | 4.534 |
| German | 1.906 | 0.572 | 1.235 | 0.550 | 0.679 |
| Greek | 2.016 | 0.991 | – | 5.740 | 2.844 |
| Hindi | 6.962 | 5.827 | – | 14.640 | 19.699 |
| Indonesian | 1.237 | 1.059 | – | 1.460 | 1.084 |
| Italian | 1.543 | 1.743 | 0.948 | 1.270 | 1.563 |
| Japanese | 3.519 | 10.646 | 3.823 | 2.760 | 4.628 |
| Korean | 1.747 | 1.865 | 1.755 | 1.180 | 1.962 |
| Polish | 1.415 | 0.766 | – | 1.260 | 1.141 |
| Portuguese | 1.877 | 1.331 | 1.526 | 1.140 | 1.938 |
| Romanian | 2.878 | 1.347 | – | 10.740 | 21.577 |
| Russian | 4.281 | 3.878 | 3.212 | 2.400 | 3.634 |
| Spanish | 1.029 | 1.084 | 1.126 | 0.910 | 1.438 |
| Thai | 2.701 | 73.936 | – | 4.230 | 2.961 |
| Turkish | 1.52 | 0.699 | – | 0.870 | 0.817 |
| Ukrainian | 1.082 | 0.997 | – | 2.300 | 6.316 |
| Vietnamese | 0.88 | 73.415 | – | 7.410 | 3.307 |
| Language | Minimax | ElevenLabs | Qwen3-TTS | FishAudio S2 | VoxCPM2 |
|---|---|---|---|---|---|
| Arabic | 73.6 | 70.6 | – | 75.0 | 79.1 |
| Cantonese | 77.8 | 67.0 | – | 80.5 | 83.5 |
| Chinese | 78.0 | 67.7 | 79.9 | 81.6 | 82.5 |
| Czech | 79.6 | 68.5 | – | 79.8 | 78.3 |
| Dutch | 73.8 | 68.0 | – | 73.0 | 80.8 |
| English | 75.6 | 61.3 | 77.5 | 79.7 | 85.4 |
| Finnish | 83.5 | 75.9 | – | 81.9 | 89.0 |
| French | 62.8 | 53.5 | 62.8 | 69.8 | 73.5 |
| German | 73.3 | 61.4 | 77.5 | 76.7 | 80.3 |
| Greek | 82.6 | 73.3 | – | 79.5 | 86.0 |
| Hindi | 81.8 | 73.0 | – | 82.1 | 85.6 |
| Indonesian | 72.9 | 66.0 | – | 76.3 | 80.0 |
| Italian | 69.9 | 57.9 | 81.7 | 74.7 | 78.0 |
| Japanese | 77.6 | 73.8 | 78.8 | 79.6 | 82.8 |
| Korean | 77.6 | 70.0 | 79.9 | 81.7 | 83.3 |
| Polish | 80.2 | 72.9 | – | 81.9 | 88.4 |
| Portuguese | 80.5 | 71.1 | 81.7 | 78.1 | 83.7 |
| Romanian | 80.9 | 69.9 | – | 73.3 | 79.7 |
| Russian | 76.1 | 67.6 | 79.2 | 79.0 | 81.1 |
| Spanish | 76.2 | 61.5 | 81.4 | 77.6 | 83.1 |
| Thai | 80.0 | 58.8 | – | 78.6 | 84.0 |
| Turkish | 77.9 | 59.6 | – | 83.5 | 87.1 |
| Ukrainian | 73.0 | 64.7 | – | 74.7 | 79.8 |
| Vietnamese | 74.3 | 36.9 | – | 74.0 | 80.6 |
| Model | InstructTTSEval-ZH | InstructTTSEval-EN | ||||
|---|---|---|---|---|---|---|
| APS⬆ | DSD⬆ | RP⬆ | APS⬆ | DSD⬆ | RP⬆ | |
| Hume | – | – | – | 83.0 | 75.3 | 54.3 |
| VoxInstruct | 47.5 | 52.3 | 42.6 | 54.9 | 57.0 | 39.3 |
| Parler-tts-mini | – | – | – | 63.4 | 48.7 | 28.6 |
| Parler-tts-large | – | – | – | 60.0 | 45.9 | 31.2 |
| PromptTTS | – | – | – | 64.3 | 47.2 | 31.4 |
| PromptStyle | – | – | – | 57.4 | 46.4 | 30.9 |
| VoiceSculptor | 75.7 | 64.7 | 61.5 | – | – | – |
| Mimo-Audio-7B-Instruct | 75.7 | 74.3 | 61.5 | 80.6 | 77.6 | 59.5 |
| Qwen3TTS-12Hz-1.7B-VD | 85.2 | 81.1 | 65.1 | 82.9 | 82.4 | 68.4 |
| VoxCPM2 | 85.2 | 71.5 | 60.8 | 84.2 | 83.2 | 71.4 |
VoxCPM 支持全参数微调(SFT) 和 LoRA 微调。仅需 5-10分钟 的音频数据,即可适配特定说话人、语言或领域。
# LoRA 微调(参数高效,推荐)
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_lora.yaml
# 全参数微调
python scripts/train_voxcpm_finetune.py \
--config_path conf/voxcpm_v2/voxcpm_finetune_all.yaml
# WebUI 训练与推理
python lora_ft_webui.py # 然后打开 http://localhost:7860
完整指南 → 微调文档(数据准备、配置、训练、LoRA 热切换、常见问题)
| 主题 | 链接 |
|---|---|
| 快速开始与安装 | 快速开始 |
| 使用指南与 Cookbook | 使用指南 |
| VoxCPM 系列模型 | 模型列表 |
| 微调(SFT & LoRA) | 微调指南 |
| 常见问题 | FAQ |
| 项目 | 说明 |
|---|---|
| Nano-vLLM | 高吞吐快速 GPU 推理引擎 |
| VoxCPM.cpp | GGML/GGUF:CPU、CUDA、Vulkan 推理 |
| VoxCPM-ONNX | ONNX 导出,支持 CPU 推理 |
| VoxCPMANE | Apple Neural Engine 后端 |
| voxcpm_rs | Rust 重新实现 |
| ComfyUI-VoxCPM | ComfyUI 节点工作流 |
| ComfyUI-VoxCPMTTS | ComfyUI TTS 扩展 |
| TTS WebUI | 浏览器端 TTS 扩展 |
完整生态见文档。社区项目非 OpenBMB 官方维护。做了什么有趣的东西?提 Issue 或 PR 把它加进来!
如果 VoxCPM 对您有帮助,请考虑引用我们的工作并为仓库加星 ⭐!
@article{voxcpm2_2026, title = {VoxCPM2: Tokenizer-Free TTS for Multilingual Speech Generation, Creative Voice Design, and True-to-Life Cloning}, author = {VoxCPM Team}, journal = {GitHub}, year = {2026}, } @article{voxcpm2025, title = {VoxCPM: Tokenizer-Free TTS for Context-Aware Speech Generation and True-to-Life Voice Cloning}, author = {Zhou, Yixuan and Zeng, Guoyang and Liu, Xin and Li, Xiang and Yu, Renjie and Wang, Ziyang and Ye, Runchuan and Sun, Weiyue and Gui, Jiancheng and Li, Kehan and Wu, Zhiyong and Liu, Zhiyuan}, journal = {arXiv preprint arXiv:2509.24650}, year = {2025}, }
VoxCPM 模型权重和代码基于 Apache-2.0 协议开源。