🌐Project Page |🤗 Hugging Face| 🤖 ModelScope | 🎮 Gradio Demo
Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.
🚀 Core Capabilities
https://github.com/user-attachments/assets/eb0e900e-ed5e-40ca-98df-31c244939527
Ming-omni-tts features key optimizations as follows, compared to other audio-assisted LLMs:
Speech metrics are evaluated on AISHELL-3(44.1khz-Chinese) and VCTK(44.1khz-English).
Music metrics are evaluated on MUSDB18(44.1khz) and MUSDB18-HQ(44.1khz).
Audio metrics are evaluated on AudioCaps.
| Model | Institution | seed-tts-eval-zh | seed-tts-eval-en | ||
|---|---|---|---|---|---|
| WER ↓ | SIM ↑ | WER ↓ | SIM ↑ | ||
| Seed-TTS | BytedanceSpeech | 1.11 | 0.796 | 2.24 | 0.762 |
| MaskGCT | College | 2.27 | 0.774 | 2.62 | 0.714 |
| E2 TTS | Microsoft | 1.97 | 0.730 | 2.19 | 0.710 |
| F5-TTS | College | 1.56 | 0.741 | 1.83 | 0.647 |
| CosyVoice 2 | Alibaba | 1.45 | 0.748 | 2.57 | 0.652 |
| Qwen3-Omni-30B-A3B | Alibaba | 1.07 | – | 1.39 | – |
| CosyVoice 3-0.5B | Alibaba | 1.16 | 0.780 | 2.02 | 0.718 |
| CosyVoice 3-1.5B | Alibaba | 0.71 | 0.775 | 1.45 | 0.695 |
| Qwen3-TTS-25Hz-0.6B-Base | Alibaba | 1.18 | – | 1.64 | – |
| Qwen3-TTS-25Hz-1.7B-Base | Alibaba | 1.10 | – | 1.49 | – |
| Qwen3-TTS-12Hz-0.6B-Base | Alibaba | 0.92 | – | 1.32 | – |
| Qwen3-TTS-12Hz-1.7B-Base | Alibaba | 0.77 | – | 1.24 | – |
| GLM-TTS | Zhipu AI | 1.03 | 0.761 | 2.23 | 0.672 |
| Ming-Flash-Omni-preview | Ant Group | 0.99 | 0.740 | 1.59 | 0.680 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 0.87 | 0.72 | 2.19 | 0.61 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 0.83 | 0.75 | 2.02 | 0.62 |
| Model | Institution | Instruction success rate | wer | sim | |||
|---|---|---|---|---|---|---|---|
| speech rate | speech volume | speech F0 | avg. | ||||
| CosyVoice3 | Alibaba | 100% | 97.67% | 65.33% | 87.67% | 1.21% | 0.58 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 97.67% | 95.00% | 91.33% | 94.67% | 0.27% | 0.712 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 96.33% | 97.00% | 83.67% | 92.33% | 0.347% | 0.776 |
Below is a comparison between Ming-omni-tts and other state-of-the-art (SOTA) models on the emotion control task.
| Model | Institution | Average | Text-Related | Text-Unrelated | ||||
|---|---|---|---|---|---|---|---|---|
| happy | sad | angry | happy | sad | angry | |||
| F5-TTS | SJTU | 0.647 | 0.92 | 0.52 | 0.72 | 0.80 | 0.28 | 0.64 |
| Sparks-TTS | HKST | 0.553 | 0.80 | 0.56 | 0.50 | 0.50 | 0.60 | 0.36 |
| GPT-SoVits | 0.517 | 0.88 | 0.54 | 0.50 | 0.48 | 0.40 | 0.30 | |
| CosyVoice2 | Alibaba | 0.587 | 0.84 | 0.72 | 0.58 | 0.56 | 0.44 | 0.38 |
| CosyVoice3-0.5B | Alibaba | 0.663 | 0.92 | 0.70 | 0.72 | 0.64 | 0.42 | 0.58 |
| CosyVoice3-1.5B | Alibaba | 0.630 | 0.86 | 0.64 | 0.72 | 0.64 | 0.44 | 0.48 |
| + DiffRO-EMO | Alibaba | 0.777 | 0.98 | 0.68 | 0.84 | 0.98 | 0.50 | 0.68 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 0.700 | 0.94 | 0.80 | 0.84 | 0.58 | 0.42 | 0.62 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 0.767 | 0.96 | 0.86 | 0.90 | 0.66 | 0.40 | 0.82 |
| Model | Institution | Average | Text-Related | Text-Unrelated | ||||
|---|---|---|---|---|---|---|---|---|
| happy | sad | angry | happy | sad | angry | |||
| CosyVoice3-0.5B | Alibaba | 0.400 | 0.68 | 0.30 | 0.78 | 0.14 | 0.04 | 0.46 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 0.343 | 0.68 | 0.26 | 0.74 | 0.14 | 0.00 | 0.24 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 0.450 | 0.78 | 0.38 | 0.76 | 0.30 | 0.02 | 0.46 |
| Model | Institution | WSC-Eval-TTS-easy | WSC-Eval-TTS-hard | WSYue-TTS-eval-Base | WSYue-TTS-eval-Coverage | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| CER(%)↓ | SIM(%)↑ | ACC↑ | CER(%)↓ | SIM(%)↑ | ACC(%)↑ | CER(%)↓ | SIM(%)↑ | ACC(%)↑ | CER(%)↓ | SIM(%)↑ | ACC(%)↑ | ||
| Step-Audio-TTS | Step | 10.83 | 67.66 | – | 12.52 | 54.52 | – | 27.79 | 0.762 | 24.25 | 0.781 | – | |
| CosyVoice 2.0 | Alibaba | 7.14 | 70.27 | – | 9.06 | 60.10 | – | 14.38 | 0.812 | – | 13.74 | 0.826 | – |
| Qwen-TTS | Alibaba | 4.13 | – | – | 7.35 | – | – | – | – | – | – | – | – |
| CosyVoice2-WSC | Alibaba | 4.28 | 72.78 | – | 8.78 | 62.59 | – | – | – | – | – | – | – |
| CosyVoice2-WSC-SFT | Alibaba | 4.08 | 78.84 | – | 7.22 | 67.96 | – | – | – | – | – | – | – |
| Llasa-1B | – | – | - | – | – | – | – | 53.31 | 0.732 | 43.68 | 0.754 | ||
| Llasa-1B-Yue | – | – | – | – | – | – | – | 10.89 | 0.762 | – | 12.78 | 0.772 | |
| Edge-TTS | – | – | – | – | – | – | – | 8.30 | – | – | 9.27 | – | – |
| Cosyvoice2-Yue | – | – | – | – | – | – | – | 10.33 | 0.821 | – | 9.49 | 0.834 | – |
| CosyVoice3 | Alibaba | 3.17 | 0.696 | 68.06 | 4.07 | 0.723 | 80.90 | 8.36 | 0.611 | 91.70 | 8.95 | 0.658 | 95.80 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 2.25 | 0.695 | 82.08 | 3.18 | 0.717 | 84.42 | 9.70 | 0.598 | 96.00 | 11.62 | 0.644 | 95.80 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 2.35 | 0.730 | 83.48 | 3.19 | 0.750 | 88.44 | 6.47 | 0.622 | 96.30 | 7.87 | 0.667 | 95.81 |
| Model | Institution | ZipVoice-Dia-zh | ||
|---|---|---|---|---|
| CER ↓ | cpSIM ↑ | UTMOS ↑ | ||
| ZipVoice-Dia | Xiaomi | 3.39% | 0.553 | 2.24 |
| MoonCast | Kimi | 27.43% | 0.441 | 1.76 |
| MOSS-TTSD | Fudan | 8.62% | 0.421 | 1.70 |
| Vibevoice-1.5B | Microsoft | 12.87% | 0.455 | 1.74 |
| FireRedTTS2 | Xiaohongshu | 3.34% | 0.512 | 1.90 |
| SoulX-Podcast | Soul | 2.20% | 0.599 | 2.09 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 2.12% | 0.457 | 2.25 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 1.84% | 0.470 | 2.19 |
| Model | Institution | InstructTTSEval-ZH | |||
|---|---|---|---|---|---|
| APS ↑ | DSD ↑ | RP ↑ | Average | ||
| Qwen3TTS-12Hz-1.7B-VD | Alibaba | 85.2 | 81.1 | 65.1 | 77.13 |
| Mimo-Audio-7B-Instruct | Xiaomi | 75.7 | 74.3 | 61.5 | 70.50 |
| VoiceSculptor | NPU | 75.7 | 64.7 | 61.5 | 67.30 |
| VoxInstruct | Tsinghua | 47.5 | 52.3 | 42.6 | 47.47 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 83.85 | 75.10 | 61.50 | 73.48 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 87.30 | 79.80 | 61.50 | 76.20 |
| Model | Institution | Ming-BGM-Eval | |||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| mulan_t | Audiobox-Aesthetics | SongEval | |||||||||||
| CE | CU | PC | PQ | Avg. | CO | MU | ME | CL | NA | Avg. | |||
| Doubao | Bytedance | 0.268 | 7.55 | 8.21 | 4.97 | 8.25 | 7.24 | 3.30 | 3.02 | 3.00 | 3.02 | 2.92 | 3.05 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 0.230 | 7.18 | 8.16 | 4.80 | 8.20 | 7.08 | 3.11 | 2.86 | 2.86 | 2.81 | 2.73 | 2.87 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 0.250 | 7.19 | 8.14 | 4.69 | 8.18 | 7.05 | 3.08 | 2.84 | 2.82 | 2.78 | 2.74 | 2.85 |
| Model | Institution | audiocaps | ||
|---|---|---|---|---|
| FDopenl3 ↓ | KLpasst ↓ | CLAPscore ↑ | ||
| AudioLDM-large | University of Surrey | 108.300 | 1.810 | 0.419 |
| Stable Audio Open | Stability AI | 96.133 | 2.148 | 0.306 |
| TangoFlux | Singapore University of Technology and Design | 137.700 | 1.041 | 0.547 |
| TangoFlux_base | Singapore University of Technology and Design | 149.270 | 1.125 | 0.523 |
| Ming-omni-tta-0.5B(ours) | Ant Group | 53.384 | 1.172 | 0.504 |
| Ming-omni-tts-0.5B(ours) | Ant Group | 74.292 | 2.257 | 0.347 |
| Ming-omni-tts-16.8B-A3B(ours) | Ant Group | 65.918 | 1.640 | 0.424 |
| Model | Institution | Internally constructed test set | |
|---|---|---|---|
| TN-Area WER ↓ | none-TN-Area WER ↓ | ||
| Gemini-2.5 Pro | 2.00% | 0.97% | |
| Ming-omni-tts-0.5B(ours) | Ant Group | 1.97% | 0.85% |
You can download our latest model and Benchmark from both Huggingface and ModelScope.
| Model | Download |
|---|---|
| Ming-omni-tts-tokenizer-12Hz |
🤗 HuggingFace 🤖 ModelScope |
| Ming-omni-tts-0.5B |
🤗 HuggingFace 🤖 ModelScope |
| Ming-omni-tts-16.8B-A3B |
🤗 HuggingFace 🤖 ModelScope |
| Ming-omni-tta-0.5B |
🤗 HuggingFace 🤖 ModelScope |
If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.
pip install modelscope modelscope download --model inclusionAI/Ming-omni-tts-0.5B --local_dir inclusionAI/Ming-omni-tts-0.5B --revision master
Note: This download process will take several minutes to several hours, depending on your network conditions.
pip install -r requirements.txt
You can set up the environment using Docker in two ways.
# 1. Pull the pre-built image
docker pull yongjielv/ming_uniaudio:v1.1
# 2. Run the container
docker run -it --gpus all yongjielv/ming_uniaudio:v1.1 /bin/bash
# 1. Build the image
docker build -t Ming-omni-tts:v1.1 -f ./docker/ming_uniaudio.dockerfile .
# 2. Run the container
docker run -it --gpus all Ming-omni-tts:v1.1 /bin/bash
git clone https://github.com/inclusionAI/MingTok-Audio.git
cd MingTok-Audio
python3 test.py
git clone https://github.com/inclusionAI/Ming-omni-tts.git
cd Ming-omni-tts
python3 cookbooks/test.py
Environment variables:
MODEL_PATH (default: inclusionAI/Ming-omni-tts-0.5B)DEVICE (default: cuda:0)For detailed usage, please refer to demo.ipynb.
Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.
If you find our work helpful, feel free to give us a cite.
本项目fork自InclusionAI的Ming-omni-tts,并利用其模型功能做了一些工作:
yscgreedy<20260303>:
git clone https://github.com/Yscgreedy/Ming-omni-tts.git
pip install -r requirements.txt
uvicorn service.app:app --host 0.0.0.0 --port 8000
# or
python ./service/app.py
Unified endpoint:
GET /healthzPOST /seedPOST /v1/streamPOST /v1/generateText generation request:
curl -X POST http://127.0.0.1:8000/v1/generate \
-H "Content-Type: application/json" \
-d '{
"task_type": "text",
"prompt": "Please generate speech based on the following description.\n",
"text": "化学反应方程式:\\ce{2H2 + O2 -> 2H2O}",
"max_decode_steps": 200
}'
Speech generation (stream wav):
curl -X POST http://127.0.0.1:8000/v1/generate \
-H "Content-Type: application/json" \
-d '{
"task_type": "speech",
"response_mode": "stream",
"prompt": "Please generate speech based on the following description.\n",
"text": "我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。",
"use_spk_emb": true,
"prompt_wav_path": "data/wavs/10002287-00000094.wav",
"prompt_text": "在此奉劝大家别乱打美白针。",
"max_decode_steps": 200,
"cfg": 2.0,
"sigma": 0.25,
"temperature": 0.0
}' --output output/generated.wav
Speech generation (save path and return metadata):
curl -X POST http://127.0.0.1:8000/v1/generate \
-H "Content-Type: application/json" \
-d '{
"task_type": "speech",
"response_mode": "path",
"output_wav_path": "output/service/demo.wav",
"prompt": "Please generate speech based on the following description.\n",
"text": "此次业绩下滑原因,可归结为企业停止服务某些品牌,而带来的负面影响。",
"use_spk_emb": true,
"prompt_wav_path": "data/wavs/00000309-00000300.wav"
}'
0.1.1
当前仓库已补充对 prompt_wav_path URL 输入的支持,可直接从 HTTP/HTTPS 地址下载克隆参考音频,再交给推理流程读取。
当前实现会把远程参考音频按“完整下载 URL 哈希”暂存到共享缓存目录,默认 30min 内命中则直接复用,适配同机多进程推理服务。
Ming-omni-tts/runtime-cache/prompt-audio/PROMPT_AUDIO_CACHE_DIR 覆盖缓存目录1800 秒,可通过 PROMPT_AUDIO_CACHE_TTL_SECONDS 覆盖