logo
0
0
WeChat Login
原项目介绍
Ming-omni-tts Logo

Ming-omni-tts: A Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

🌐Project Page |🤗 Hugging Face| 🤖 ModelScope | 🎮 Gradio Demo

Table of Contents

Introduction

Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.

🚀 Core Capabilities

  • 🔊 Fine-grained Vocal Control: The model supports precise control over speech rate, pitch, volume, emotion, and dialect through simple commands. Notably, its accuracy for Cantonese dialect control is as high as 93%, and its emotion control accuracy reaches 46.7%, surpassing CosyVoice3.
  • 🌌 Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.
  • 🎶 Immersive Unified Generation: The industry’s first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.
  • High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.
  • 🧪 Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.

Demo

https://github.com/user-attachments/assets/eb0e900e-ed5e-40ca-98df-31c244939527

Updates

🚀 Key Features

Ming-omni-tts features key optimizations as follows, compared to other audio-assisted LLMs:

  • Unified Continuous Audio Tokenizer: We propose a continuous VAE-based tokenizer that integrates speech, music, and general audio into a unified latent space with 12.5 Hz frame rate, yielding competitive results across audio reconstruction and various downstream synthesis benchmarks.

  • Unified Audio Language Model for Speech, Music and Sound Generation: We present a unified, end-to-end audio language model that employs a single LLM backbone to perform joint generation of speech, music, and general sound. To enhance audio quality, the architecture is augmented with a Diffusion Head. Furthermore, we employ a patch-based generation strategy with a patch size of 4 and a look-back history of 32, enabling an optimal balance between local acoustic detail and long-range structural coherence.

Evaluation

  • Reconstruction: The 12Hz tokenizer supports high-quality reconstruction across speech, music, and sound. Its performance is comparable to existing state-of-the-art methods across key fidelity metrics.
  • Dialect Generation: Achieves 96% accuracy on WSYue-TTS-Eval and 86% WSC-TTS-Eval, outperforming CosyVoice3.
  • Emotional Expressiveness: Delivers an average accuracy of 76.7% on CV3-Eval emotional sets and 46.7% on neutral emotion sets, significantly surpassing CosyVoice3-Base (40%) to reach SOTA levels.
  • Instruction-based Voice Design: Scores 76.20% on InstructTTS-Eval-ZH. Its instruction-following capability is on par with Qwen3-TTS-VoiceDesign.
  • Zero-shot Voice Clone: Exhibits exceptional stability on Seed-tts-eval (Chinese) with a WER of 0.83%, outperforming SeedTTS and GLM-TTS.
  • Text Normalization (TN): On internal technical testsets, the model achieves a CER of 1.97% in normalized regions, delivering performance comparable to Gemini-2.5 Pro.

Audio Tokenizer

Speech metrics are evaluated on AISHELL-3(44.1khz-Chinese) and VCTK(44.1khz-English).
Music metrics are evaluated on MUSDB18(44.1khz) and MUSDB18-HQ(44.1khz).
Audio metrics are evaluated on AudioCaps.

Speech Controllable Generative Tasks

Zero-shot TTS

Zero-shot speech generation performance comparison on the Seed-TTS testset.
ModelInstitutionseed-tts-eval-zhseed-tts-eval-en
WER ↓SIM ↑WER ↓SIM ↑
Seed-TTSBytedanceSpeech1.110.7962.240.762
MaskGCTCollege2.270.7742.620.714
E2 TTSMicrosoft1.970.7302.190.710
F5-TTSCollege1.560.7411.830.647
CosyVoice 2Alibaba1.450.7482.570.652
Qwen3-Omni-30B-A3BAlibaba1.071.39
CosyVoice 3-0.5BAlibaba1.160.7802.020.718
CosyVoice 3-1.5BAlibaba0.710.7751.450.695
Qwen3-TTS-25Hz-0.6B-BaseAlibaba1.181.64
Qwen3-TTS-25Hz-1.7B-BaseAlibaba1.101.49
Qwen3-TTS-12Hz-0.6B-BaseAlibaba0.921.32
Qwen3-TTS-12Hz-1.7B-BaseAlibaba0.771.24
GLM-TTSZhipu AI1.030.7612.230.672
Ming-Flash-Omni-previewAnt Group0.990.7401.590.680
Ming-omni-tts-0.5B(ours)Ant Group0.870.722.190.61
Ming-omni-tts-16.8B-A3B(ours)Ant Group0.830.752.020.62

Speech Attribute Control

ModelInstitutionInstruction success ratewersim
speech ratespeech volumespeech F0avg.
CosyVoice3Alibaba100%97.67%65.33%87.67%1.21%0.58
Ming-omni-tts-0.5B(ours)Ant Group97.67%95.00%91.33%94.67%0.27%0.712
Ming-omni-tts-16.8B-A3B(ours)Ant Group96.33%97.00%83.67%92.33%0.347%0.776

Emotional Control

Below is a comparison between Ming-omni-tts and other state-of-the-art (SOTA) models on the emotion control task.

Emotion Accuracy on the Text-Related and Text-Unrelated of the CV3-Eval Emotional testsets
ModelInstitutionAverageText-RelatedText-Unrelated
happysadangryhappysadangry
F5-TTSSJTU0.6470.920.520.720.800.280.64
Sparks-TTSHKST0.5530.800.560.500.500.600.36
GPT-SoVits0.5170.880.540.500.480.400.30
CosyVoice2Alibaba0.5870.840.720.580.560.440.38
CosyVoice3-0.5BAlibaba0.6630.920.700.720.640.420.58
CosyVoice3-1.5BAlibaba0.6300.860.640.720.640.440.48
+ DiffRO-EMOAlibaba0.7770.980.680.840.980.500.68
Ming-omni-tts-0.5B(ours)Ant Group0.7000.940.800.840.580.420.62
Ming-omni-tts-16.8B-A3B(ours)Ant Group0.7670.960.860.900.660.400.82
Emotion Accuracy on the Text-Related and Text-Unrelated of CV3-Eval Neutral testsets
ModelInstitutionAverageText-RelatedText-Unrelated
happysadangryhappysadangry
CosyVoice3-0.5BAlibaba0.4000.680.300.780.140.040.46
Ming-omni-tts-0.5B(ours)Ant Group0.3430.680.260.740.140.000.24
Ming-omni-tts-16.8B-A3B(ours)Ant Group0.4500.780.380.760.300.020.46

Dialect Control

Dialect performance comparison
ModelInstitutionWSC-Eval-TTS-easyWSC-Eval-TTS-hardWSYue-TTS-eval-BaseWSYue-TTS-eval-Coverage
CER(%)↓SIM(%)↑ACC↑CER(%)↓SIM(%)↑ACC(%)↑CER(%)↓SIM(%)↑ACC(%)↑CER(%)↓SIM(%)↑ACC(%)↑
Step-Audio-TTSStep10.8367.6612.5254.5227.790.76224.250.781
CosyVoice 2.0Alibaba7.1470.279.0660.1014.380.81213.740.826
Qwen-TTSAlibaba4.137.35
CosyVoice2-WSCAlibaba4.2872.788.7862.59
CosyVoice2-WSC-SFTAlibaba4.0878.847.2267.96
Llasa-1B-53.310.73243.680.754
Llasa-1B-Yue10.890.76212.780.772
Edge-TTS8.309.27
Cosyvoice2-Yue10.330.8219.490.834
CosyVoice3Alibaba3.170.69668.064.070.72380.908.360.61191.708.950.65895.80
Ming-omni-tts-0.5B(ours)Ant Group2.250.69582.083.180.71784.429.700.59896.0011.620.64495.80
Ming-omni-tts-16.8B-A3B(ours)Ant Group2.350.73083.483.190.75088.446.470.62296.307.870.66795.81

Podcast TTS

Podcast performance comparison on the ZipVoice-Dia-zh test set
ModelInstitutionZipVoice-Dia-zh
CER ↓cpSIM ↑UTMOS ↑
ZipVoice-DiaXiaomi3.39%0.5532.24
MoonCastKimi27.43%0.4411.76
MOSS-TTSDFudan8.62%0.4211.70
Vibevoice-1.5BMicrosoft12.87%0.4551.74
FireRedTTS2Xiaohongshu3.34%0.5121.90
SoulX-PodcastSoul2.20%0.5992.09
Ming-omni-tts-0.5B(ours)Ant Group2.12%0.4572.25
Ming-omni-tts-16.8B-A3B(ours)Ant Group1.84%0.4702.19

Voice Design

Voice Design performance comparison on the InstructTTSEval-ZH test set
ModelInstitutionInstructTTSEval-ZH
APS ↑DSD ↑RP ↑Average
Qwen3TTS-12Hz-1.7B-VDAlibaba85.281.165.177.13
Mimo-Audio-7B-InstructXiaomi75.774.361.570.50
VoiceSculptorNPU75.764.761.567.30
VoxInstructTsinghua47.552.342.647.47
Ming-omni-tts-0.5B(ours)Ant Group83.8575.1061.5073.48
Ming-omni-tts-16.8B-A3B(ours)Ant Group87.3079.8061.5076.20

Audio & BGM Generation

Text-To-BGM

Text-to-BGM performance comparison on the Ming-BGM-Eval test set
ModelInstitutionMing-BGM-Eval
mulan_tAudiobox-AestheticsSongEval
CECUPCPQAvg.COMUMECLNAAvg.
DoubaoBytedance0.2687.558.214.978.257.243.303.023.003.022.923.05
Ming-omni-tts-0.5B(ours)Ant Group0.2307.188.164.808.207.083.112.862.862.812.732.87
Ming-omni-tts-16.8B-A3B(ours)Ant Group0.2507.198.144.698.187.053.082.842.822.782.742.85

Text-To-Audio(TTA)

TTA performance comparison on the audiocaps test set
ModelInstitutionaudiocaps
FDopenl3KLpasstCLAPscore
AudioLDM-largeUniversity of Surrey108.3001.8100.419
Stable Audio OpenStability AI96.1332.1480.306
TangoFluxSingapore University of Technology and Design137.7001.0410.547
TangoFlux_baseSingapore University of Technology and Design149.2701.1250.523
Ming-omni-tta-0.5B(ours)Ant Group53.3841.1720.504
Ming-omni-tts-0.5B(ours)Ant Group74.2922.2570.347
Ming-omni-tts-16.8B-A3B(ours)Ant Group65.9181.6400.424

Text Normalization

Text Normalization performance comparison on the internally constructed test set
ModelInstitutionInternally constructed test set
TN-Area WER ↓none-TN-Area WER ↓
Gemini-2.5 ProGoogle2.00%0.97%
Ming-omni-tts-0.5B(ours)Ant Group1.97%0.85%

Model & Benchmark Downloads

You can download our latest model and Benchmark from both Huggingface and ModelScope.

ModelDownload
Ming-omni-tts-tokenizer-12Hz 🤗 HuggingFace
🤖 ModelScope
Ming-omni-tts-0.5B 🤗 HuggingFace
🤖 ModelScope
Ming-omni-tts-16.8B-A3B 🤗 HuggingFace
🤖 ModelScope
Ming-omni-tta-0.5B 🤗 HuggingFace
🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

pip install modelscope modelscope download --model inclusionAI/Ming-omni-tts-0.5B --local_dir inclusionAI/Ming-omni-tts-0.5B --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Environment Preparation

Installation with pip

pip install -r requirements.txt

Installation with docker

You can set up the environment using Docker in two ways.

  • Option 1: Pull from Docker Hub (Recommended)
# 1. Pull the pre-built image docker pull yongjielv/ming_uniaudio:v1.1 # 2. Run the container docker run -it --gpus all yongjielv/ming_uniaudio:v1.1 /bin/bash
  • Option 2: Build from Source
# 1. Build the image docker build -t Ming-omni-tts:v1.1 -f ./docker/ming_uniaudio.dockerfile . # 2. Run the container docker run -it --gpus all Ming-omni-tts:v1.1 /bin/bash

Example Usage

Audio Reconstruction

git clone https://github.com/inclusionAI/MingTok-Audio.git cd MingTok-Audio python3 test.py

Audio Generation

git clone https://github.com/inclusionAI/Ming-omni-tts.git cd Ming-omni-tts python3 cookbooks/test.py

Environment variables:

  • MODEL_PATH (default: inclusionAI/Ming-omni-tts-0.5B)
  • DEVICE (default: cuda:0)

For detailed usage, please refer to demo.ipynb.

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.

Citation

If you find our work helpful, feel free to give us a cite.

有关本项目

本项目fork自InclusionAI的Ming-omni-tts,并利用其模型功能做了一些工作:
yscgreedy<20260303>:

  • 使用fastapi为本模型添加了一个可用的后端,app.py在service目录下,cookbooks/test.py中的MingAudio类提取到了ming_audio.py中,供后端调用。开放的接口为/healthz, /v1/generate, /v1/stream, /seed,具体调用方式见test.ipynb。
  • 原有的Ming-omni-tts模型底层支持流式输出,但MingAudio并没有提供对应方法,因此迁移时添加了对应方法并在fastapi中暴露了接口。
    • 简单的rtf测试(位于test.ipynb),在Windows11/RTX 4060 Laptop GPU上,流式输出的rtf约为0.8。

Backend Service (FastAPI, Single Endpoint)

git clone https://github.com/Yscgreedy/Ming-omni-tts.git pip install -r requirements.txt uvicorn service.app:app --host 0.0.0.0 --port 8000 # or python ./service/app.py

Unified endpoint:

  • GET /healthz
  • POST /seed
  • POST /v1/stream
  • POST /v1/generate

Text generation request:

curl -X POST http://127.0.0.1:8000/v1/generate \ -H "Content-Type: application/json" \ -d '{ "task_type": "text", "prompt": "Please generate speech based on the following description.\n", "text": "化学反应方程式:\\ce{2H2 + O2 -> 2H2O}", "max_decode_steps": 200 }'

Speech generation (stream wav):

curl -X POST http://127.0.0.1:8000/v1/generate \ -H "Content-Type: application/json" \ -d '{ "task_type": "speech", "response_mode": "stream", "prompt": "Please generate speech based on the following description.\n", "text": "我们的愿景是构建未来服务业的数字化基础设施,为世界带来更多微小而美好的改变。", "use_spk_emb": true, "prompt_wav_path": "data/wavs/10002287-00000094.wav", "prompt_text": "在此奉劝大家别乱打美白针。", "max_decode_steps": 200, "cfg": 2.0, "sigma": 0.25, "temperature": 0.0 }' --output output/generated.wav

Speech generation (save path and return metadata):

curl -X POST http://127.0.0.1:8000/v1/generate \ -H "Content-Type: application/json" \ -d '{ "task_type": "speech", "response_mode": "path", "output_wav_path": "output/service/demo.wav", "prompt": "Please generate speech based on the following description.\n", "text": "此次业绩下滑原因,可归结为企业停止服务某些品牌,而带来的负面影响。", "use_spk_emb": true, "prompt_wav_path": "data/wavs/00000309-00000300.wav" }'

当前协作版本

0.1.1

当前仓库已补充对 prompt_wav_path URL 输入的支持,可直接从 HTTP/HTTPS 地址下载克隆参考音频,再交给推理流程读取。 当前实现会把远程参考音频按“完整下载 URL 哈希”暂存到共享缓存目录,默认 30min 内命中则直接复用,适配同机多进程推理服务。

参考音频 URL 缓存

  • 缓存目录默认位于 Ming-omni-tts/runtime-cache/prompt-audio/
  • 可通过环境变量 PROMPT_AUDIO_CACHE_DIR 覆盖缓存目录
  • 缓存 TTL 默认 1800 秒,可通过 PROMPT_AUDIO_CACHE_TTL_SECONDS 覆盖
  • 同一完整 URL 会命中同一个缓存文件;URL 中版本参数变化会自动落到新的缓存文件
  • 下载采用临时文件加原子替换,避免多进程读到半文件

About

No description, topics, or website provided.
3.05 GiB
0 forks0 stars1 branches0 TagREADMEMIT license
Language
Python85.7%
Dockerfile0.6%
Jinja0.6%
Shell0.1%
Others13%