Public

WeChat Login

Code Issues Pull requests Events Packages Insights

main

Branch

Tag

原项目介绍

Ming-omni-tts: A Simple and Efficient Unified Generation of Speech, Music, and Sound with Precise Control

🌐Project Page ｜🤗 Hugging Face｜ 🤖 ModelScope | 🎮 Gradio Demo

Introduction
Demo
Updates
Key Features
Evaluation
Model & Benchmark Downloads
Environment Preparation
Example Usage
- Audio Reconstruction
- Audio Generation
Citation

Introduction

Ming-omni-tts is a high-performance unified audio generation model that achieves precise control over speech attributes and enables single-channel synthesis of speech, environmental sounds, and music. Powered by a custom 12.5Hz continuous tokenizer and Patch-by-Patch compression, it delivers competitive inference efficiency (3.1Hz). Additionally, the model features robust text normalization capabilities for the accurate and natural narration of complex mathematical and chemical expressions.

🚀 Core Capabilities

🔊 Fine-grained Vocal Control: The model supports precise control over speech rate, pitch, volume, emotion, and dialect through simple commands. Notably, its accuracy for Cantonese dialect control is as high as 93%, and its emotion control accuracy reaches 46.7%, surpassing CosyVoice3.
🌌 Intelligent Voice Design: Features 100+ premium built-in voices and supports zero-shot voice design through natural language descriptions. Its performance on the Instruct-TTS-Eval-zh benchmark is on par with Qwen3-TTS.
🎶 Immersive Unified Generation: The industry’s first autoregressive model to jointly generate speech, ambient sound, and music in a single channel. Built on a custom 12.5Hz continuous tokenizer and a DiT head architecture, it delivers a seamless, "in-the-scene" auditory experience.
⚡ High-efficiency Inference: Introduces a "Patch-by-Patch" compression strategy that reduces the LLM inference frame rate to 3.1Hz. This significantly cuts latency and enables podcast-style audio generation while preserving naturalness and audio detail.
🧪 Professional Text Normalization: The model accurately parses and narrates complex formats, including mathematical expressions and chemical equations, ensuring natural-sounding output for specialized applications.

Demo

https://github.com/user-attachments/assets/eb0e900e-ed5e-40ca-98df-31c244939527

Updates

Support VLLM Inference
Technical Report
Ming-omni-tts Blog

🚀 Key Features

Ming-omni-tts features key optimizations as follows, compared to other audio-assisted LLMs:

Unified Continuous Audio Tokenizer: We propose a continuous VAE-based tokenizer that integrates speech, music, and general audio into a unified latent space with 12.5 Hz frame rate, yielding competitive results across audio reconstruction and various downstream synthesis benchmarks.

Unified Audio Language Model for Speech, Music and Sound Generation: We present a unified, end-to-end audio language model that employs a single LLM backbone to perform joint generation of speech, music, and general sound. To enhance audio quality, the architecture is augmented with a Diffusion Head. Furthermore, we employ a patch-based generation strategy with a patch size of 4 and a look-back history of 32, enabling an optimal balance between local acoustic detail and long-range structural coherence.

Evaluation

Reconstruction: The 12Hz tokenizer supports high-quality reconstruction across speech, music, and sound. Its performance is comparable to existing state-of-the-art methods across key fidelity metrics.
Dialect Generation: Achieves 96% accuracy on WSYue-TTS-Eval and 86% WSC-TTS-Eval, outperforming CosyVoice3.
Emotional Expressiveness: Delivers an average accuracy of 76.7% on CV3-Eval emotional sets and 46.7% on neutral emotion sets, significantly surpassing CosyVoice3-Base (40%) to reach SOTA levels.
Instruction-based Voice Design: Scores 76.20% on InstructTTS-Eval-ZH. Its instruction-following capability is on par with Qwen3-TTS-VoiceDesign.
Zero-shot Voice Clone: Exhibits exceptional stability on Seed-tts-eval (Chinese) with a WER of 0.83%, outperforming SeedTTS and GLM-TTS.
Text Normalization (TN): On internal technical testsets, the model achieves a CER of 1.97% in normalized regions, delivering performance comparable to Gemini-2.5 Pro.

Audio Tokenizer

Speech metrics are evaluated on AISHELL-3(44.1khz-Chinese) and VCTK(44.1khz-English).
Music metrics are evaluated on MUSDB18(44.1khz) and MUSDB18-HQ(44.1khz).
Audio metrics are evaluated on AudioCaps.

Speech Controllable Generative Tasks

Zero-shot TTS

Zero-shot speech generation performance comparison on the Seed-TTS testset.

Model	Institution	seed-tts-eval-zh		seed-tts-eval-en
Model	Institution	WER ↓	SIM ↑	WER ↓	SIM ↑
Seed-TTS	BytedanceSpeech	1.11	0.796	2.24	0.762
MaskGCT	College	2.27	0.774	2.62	0.714
E2 TTS	Microsoft	1.97	0.730	2.19	0.710
F5-TTS	College	1.56	0.741	1.83	0.647
CosyVoice 2	Alibaba	1.45	0.748	2.57	0.652
Qwen3-Omni-30B-A3B	Alibaba	1.07	–	1.39	–
CosyVoice 3-0.5B	Alibaba	1.16	0.780	2.02	0.718
CosyVoice 3-1.5B	Alibaba	0.71	0.775	1.45	0.695
Qwen3-TTS-25Hz-0.6B-Base	Alibaba	1.18	–	1.64	–
Qwen3-TTS-25Hz-1.7B-Base	Alibaba	1.10	–	1.49	–
Qwen3-TTS-12Hz-0.6B-Base	Alibaba	0.92	–	1.32	–
Qwen3-TTS-12Hz-1.7B-Base	Alibaba	0.77	–	1.24	–
GLM-TTS	Zhipu AI	1.03	0.761	2.23	0.672
Ming-Flash-Omni-preview	Ant Group	0.99	0.740	1.59	0.680
Ming-omni-tts-0.5B(ours)	Ant Group	0.87	0.72	2.19	0.61
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	0.83	0.75	2.02	0.62

Speech Attribute Control

Model	Institution	Instruction success rate				wer	sim
Model	Institution	speech rate	speech volume	speech F0	avg.	wer	sim
CosyVoice3	Alibaba	100%	97.67%	65.33%	87.67%	1.21%	0.58
Ming-omni-tts-0.5B(ours)	Ant Group	97.67%	95.00%	91.33%	94.67%	0.27%	0.712
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	96.33%	97.00%	83.67%	92.33%	0.347%	0.776

Emotional Control

Below is a comparison between Ming-omni-tts and other state-of-the-art (SOTA) models on the emotion control task.

Emotion Accuracy on the Text-Related and Text-Unrelated of the CV3-Eval Emotional testsets

Model	Institution	Average	Text-Related			Text-Unrelated
Model	Institution	Average	happy	sad	angry	happy	sad	angry
F5-TTS	SJTU	0.647	0.92	0.52	0.72	0.80	0.28	0.64
Sparks-TTS	HKST	0.553	0.80	0.56	0.50	0.50	0.60	0.36
GPT-SoVits		0.517	0.88	0.54	0.50	0.48	0.40	0.30
CosyVoice2	Alibaba	0.587	0.84	0.72	0.58	0.56	0.44	0.38
CosyVoice3-0.5B	Alibaba	0.663	0.92	0.70	0.72	0.64	0.42	0.58
CosyVoice3-1.5B	Alibaba	0.630	0.86	0.64	0.72	0.64	0.44	0.48
+ DiffRO-EMO	Alibaba	0.777	0.98	0.68	0.84	0.98	0.50	0.68
Ming-omni-tts-0.5B(ours)	Ant Group	0.700	0.94	0.80	0.84	0.58	0.42	0.62
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	0.767	0.96	0.86	0.90	0.66	0.40	0.82

Emotion Accuracy on the Text-Related and Text-Unrelated of CV3-Eval Neutral testsets

Model	Institution	Average	Text-Related			Text-Unrelated
Model	Institution	Average	happy	sad	angry	happy	sad	angry
CosyVoice3-0.5B	Alibaba	0.400	0.68	0.30	0.78	0.14	0.04	0.46
Ming-omni-tts-0.5B(ours)	Ant Group	0.343	0.68	0.26	0.74	0.14	0.00	0.24
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	0.450	0.78	0.38	0.76	0.30	0.02	0.46

Dialect Control

Dialect performance comparison

Model	Institution	WSC-Eval-TTS-easy			WSC-Eval-TTS-hard			WSYue-TTS-eval-Base			WSYue-TTS-eval-Coverage
Model	Institution	CER(%)↓	SIM(%)↑	ACC↑	CER(%)↓	SIM(%)↑	ACC(%)↑	CER(%)↓	SIM(%)↑	ACC(%)↑	CER(%)↓	SIM(%)↑	ACC(%)↑
Step-Audio-TTS	Step	10.83	67.66	–	12.52	54.52	–	27.79	0.762		24.25	0.781	–
CosyVoice 2.0	Alibaba	7.14	70.27	–	9.06	60.10	–	14.38	0.812	–	13.74	0.826	–
Qwen-TTS	Alibaba	4.13	–	–	7.35	–	–	–	–	–	–	–	–
CosyVoice2-WSC	Alibaba	4.28	72.78	–	8.78	62.59	–	–	–	–	–	–	–
CosyVoice2-WSC-SFT	Alibaba	4.08	78.84	–	7.22	67.96	–	–	–	–	–	–	–
Llasa-1B	–	–	-	–	–	–	–	53.31	0.732		43.68	0.754
Llasa-1B-Yue	–	–	–	–	–	–	–	10.89	0.762	–	12.78	0.772
Edge-TTS	–	–	–	–	–	–	–	8.30	–	–	9.27	–	–
Cosyvoice2-Yue	–	–	–	–	–	–	–	10.33	0.821	–	9.49	0.834	–
CosyVoice3	Alibaba	3.17	0.696	68.06	4.07	0.723	80.90	8.36	0.611	91.70	8.95	0.658	95.80
Ming-omni-tts-0.5B(ours)	Ant Group	2.25	0.695	82.08	3.18	0.717	84.42	9.70	0.598	96.00	11.62	0.644	95.80
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	2.35	0.730	83.48	3.19	0.750	88.44	6.47	0.622	96.30	7.87	0.667	95.81

Podcast TTS

Podcast performance comparison on the ZipVoice-Dia-zh test set

Model	Institution	ZipVoice-Dia-zh
Model	Institution	CER ↓	cpSIM ↑	UTMOS ↑
ZipVoice-Dia	Xiaomi	3.39%	0.553	2.24
MoonCast	Kimi	27.43%	0.441	1.76
MOSS-TTSD	Fudan	8.62%	0.421	1.70
Vibevoice-1.5B	Microsoft	12.87%	0.455	1.74
FireRedTTS2	Xiaohongshu	3.34%	0.512	1.90
SoulX-Podcast	Soul	2.20%	0.599	2.09
Ming-omni-tts-0.5B(ours)	Ant Group	2.12%	0.457	2.25
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	1.84%	0.470	2.19

Voice Design

Voice Design performance comparison on the InstructTTSEval-ZH test set

Model	Institution	InstructTTSEval-ZH
Model	Institution	APS ↑	DSD ↑	RP ↑	Average
Qwen3TTS-12Hz-1.7B-VD	Alibaba	85.2	81.1	65.1	77.13
Mimo-Audio-7B-Instruct	Xiaomi	75.7	74.3	61.5	70.50
VoiceSculptor	NPU	75.7	64.7	61.5	67.30
VoxInstruct	Tsinghua	47.5	52.3	42.6	47.47
Ming-omni-tts-0.5B(ours)	Ant Group	83.85	75.10	61.50	73.48
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	87.30	79.80	61.50	76.20

Audio & BGM Generation

Text-To-BGM

Text-to-BGM performance comparison on the Ming-BGM-Eval test set

Model	Institution	Ming-BGM-Eval
		mulan_t	Audiobox-Aesthetics					SongEval
		mulan_t	CE	CU	PC	PQ	Avg.	CO	MU	ME	CL	NA	Avg.
Doubao	Bytedance	0.268	7.55	8.21	4.97	8.25	7.24	3.30	3.02	3.00	3.02	2.92	3.05
Ming-omni-tts-0.5B(ours)	Ant Group	0.230	7.18	8.16	4.80	8.20	7.08	3.11	2.86	2.86	2.81	2.73	2.87
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	0.250	7.19	8.14	4.69	8.18	7.05	3.08	2.84	2.82	2.78	2.74	2.85

Text-To-Audio(TTA)

TTA performance comparison on the audiocaps test set

Model	Institution	audiocaps
Model	Institution	FD_openl3 ↓	KL_passt ↓	CLAP_score ↑
AudioLDM-large	University of Surrey	108.300	1.810	0.419
Stable Audio Open	Stability AI	96.133	2.148	0.306
TangoFlux	Singapore University of Technology and Design	137.700	1.041	0.547
TangoFlux_base	Singapore University of Technology and Design	149.270	1.125	0.523
Ming-omni-tta-0.5B(ours)	Ant Group	53.384	1.172	0.504
Ming-omni-tts-0.5B(ours)	Ant Group	74.292	2.257	0.347
Ming-omni-tts-16.8B-A3B(ours)	Ant Group	65.918	1.640	0.424

Text Normalization

Text Normalization performance comparison on the internally constructed test set

Model	Institution	Internally constructed test set
Model	Institution	TN-Area WER ↓	none-TN-Area WER ↓
Gemini-2.5 Pro	Google	2.00%	0.97%
Ming-omni-tts-0.5B(ours)	Ant Group	1.97%	0.85%

Model & Benchmark Downloads

You can download our latest model and Benchmark from both Huggingface and ModelScope.

Model	Download
Ming-omni-tts-tokenizer-12Hz	🤗 HuggingFace 🤖 ModelScope
Ming-omni-tts-0.5B	🤗 HuggingFace 🤖 ModelScope
Ming-omni-tts-16.8B-A3B	🤗 HuggingFace 🤖 ModelScope
Ming-omni-tta-0.5B	🤗 HuggingFace 🤖 ModelScope

If you're in mainland China, we strongly recommend you to download our model from 🤖 ModelScope.

pip install modelscope
modelscope download --model inclusionAI/Ming-omni-tts-0.5B --local_dir inclusionAI/Ming-omni-tts-0.5B  --revision master

Note: This download process will take several minutes to several hours, depending on your network conditions.

Environment Preparation

Installation with pip

pip install -r requirements.txt

Installation with docker

当前仓库内的 Dockerfile 默认优先构建“最小可进入容器/可补装依赖”的环境，避免 torchtune -> blobfile -> lxml 这类镜像源问题直接卡死构建。需要完整训练/加速依赖时，可在构建后进入容器手动安装，或在构建阶段显式打开相关 build arg。

You can set up the environment using Docker in two ways.

Option 1: Pull from Docker Hub (Recommended)

# 1. Pull the pre-built image
docker pull yongjielv/ming_uniaudio:v1.1

# 2. Run the container and expose the inference ports
docker run --rm -it --gpus all -p 8001:8001 \
  -e MODEL_PATH=inclusionAI/Ming-omni-tts-0.5B \
  -e DEVICE=cuda:0 \
  -e NUM_WORKERS=1 \
  -e BASE_PORT=8000 \
  yongjielv/ming_uniaudio:v1.1

Option 2: Build from Source

# 1. Build the image
docker build -t Ming-omni-tts:v1.1 -f ./ming_uniaudio.dockerfile .

# 2. Run the container and expose the inference ports
docker run --rm -it --gpus all -p 8001:8001 \
  -e MODEL_PATH=inclusionAI/Ming-omni-tts-0.5B \
  -e DEVICE=cuda:0 \
  -e NUM_WORKERS=1 \
  -e BASE_PORT=8000 \
  Ming-omni-tts:v1.1

如需在构建阶段直接预装完整运行依赖或 flash-attn，可显式打开：

docker build -t Ming-omni-tts:v1.1 \
  -f ./ming_uniaudio.dockerfile \
  --build-arg INSTALL_FULL_RUNTIME_DEPS=1 \
  --build-arg INSTALL_FLASH_ATTN=1 \
  .

如果你想先快速进入容器、手动补依赖，再启动服务，推荐这样运行：

docker run --rm -it --gpus all \
  -e MING_INSTALL_PROFILE=minimal \
  -e MING_SKIP_MODEL_DOWNLOAD=1 \
  --entrypoint /bin/bash \
  Ming-omni-tts:v1.1

常用可选参数：

-e NUM_WORKERS=4：启动 4 个推理进程
-e BASE_PORT=8000：第一个推理进程监听 8001，后续依次递增
-e STARTUP_WAIT_SECONDS=10：每个进程启动后的等待秒数
-e HOST=0.0.0.0：服务绑定地址
-v /your/model/cache:/app/runtime-cache/huggingface：挂载 Hugging Face 缓存目录，减少重复下载

对象存储写入配置会在启动阶段显式校验：

当前已内置默认值：
- HEAROPUS_AUDIO_STORAGE_BACKEND=osca
- HEAROPUS_OSCA_ENDPOINT=https://fgws3-ocloud.ihep.ac.cn
- HEAROPUS_OSCA_BUCKET=21287-hearopus
- HEAROPUS_OSCA_ACCESS_KEY_ID=ccobqFjMA725W4qcTxXk
因此 OSCA 方案下，启动时通常只需要你额外提供：
- HEAROPUS_OSCA_SECRET_ACCESS_KEY
HEAROPUS_AUDIO_STORAGE_BACKEND=blob 时，必须配置 BLOB_READ_WRITE_TOKEN
HEAROPUS_AUDIO_STORAGE_BACKEND=auto 时，至少要满足上面任一整套配置，否则 env.sh 和 service.sh start 会直接报错退出

如果需要多进程映射端口，例如 NUM_WORKERS=4 且 BASE_PORT=8000：

docker run --rm -it --gpus all \
  -p 8001-8004:8001-8004 \
  -e NUM_WORKERS=4 \
  -e BASE_PORT=8000 \
  Ming-omni-tts:v1.1

如果仍需进入容器排查，可追加覆盖入口：

docker run --rm -it --gpus all --entrypoint /bin/bash Ming-omni-tts:v1.1

Example Usage

Audio Reconstruction

git clone https://github.com/inclusionAI/MingTok-Audio.git
cd MingTok-Audio
python3 test.py

Audio Generation

git clone https://github.com/inclusionAI/Ming-omni-tts.git
cd Ming-omni-tts
python3 cookbooks/test.py

Environment variables:

MODEL_PATH (default: inclusionAI/Ming-omni-tts-0.5B)
DEVICE (default: cuda:0)

For detailed usage, please refer to demo.ipynb.

Note: We test the examples on hardware of NVIDIA H800-80GB/H20-96G with CUDA 12.4.

Citation

If you find our work helpful, feel free to give us a cite.

有关本项目

本项目fork自InclusionAI的Ming-omni-tts，并利用其模型功能做了一些工作：
yscgreedy<20260303>:

使用fastapi为本模型添加了一个可用的后端，app.py在service目录下，cookbooks/test.py中的MingAudio类提取到了ming_audio.py中，供后端调用。开放的接口为/healthz, /v1/generate, /v1/stream, /seed，具体调用方式见test.ipynb。
原有的Ming-omni-tts模型底层支持流式输出，但MingAudio并没有提供对应方法，因此迁移时添加了对应方法并在fastapi中暴露了接口。
- 简单的rtf测试(位于test.ipynb)，在Windows11/RTX 4060 Laptop GPU上，流式输出的rtf约为0.8。

Backend Service (FastAPI, Single Endpoint)

git clone https://github.com/Yscgreedy/Ming-omni-tts.git
pip install -r requirements.txt
uvicorn service.app:app --host 0.0.0.0 --port 8000
# or
python ./service/app.py
# or initialize env + manage multi-process instances
./env.sh
# or only validate generated-audio storage credentials before startup
./env.sh validate-storage
./service.sh start
./service.sh status
./service.sh stop

Unified endpoint:

GET /healthz
POST /seed
POST /v1/stream
POST /v1/generate

Text generation request:

curl -X POST http://127.0.0.1:8000/v1/generate \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "text",
    "prompt": "Please generate speech based on the following description.\n",
    "text": "化学反应方程式：\\ce{2H2 + O2 -> 2H2O}",
    "max_decode_steps": 200
  }'

Speech generation (stream wav):

curl -X POST http://127.0.0.1:8000/v1/generate \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "speech",
    "response_mode": "stream",
    "prompt": "Please generate speech based on the following description.\n",
    "text": "我们的愿景是构建未来服务业的数字化基础设施，为世界带来更多微小而美好的改变。",
    "use_spk_emb": true,
    "prompt_wav_path": "data/wavs/10002287-00000094.wav",
    "prompt_text": "在此奉劝大家别乱打美白针。",
    "max_decode_steps": 200,
    "cfg": 2.0,
    "sigma": 0.25,
    "temperature": 0.0
  }' --output output/generated.wav

Speech generation (save path and return metadata):

curl -X POST http://127.0.0.1:8000/v1/generate \
  -H "Content-Type: application/json" \
  -d '{
    "task_type": "speech",
    "response_mode": "path",
    "output_wav_path": "output/service/demo.wav",
    "prompt": "Please generate speech based on the following description.\n",
    "text": "此次业绩下滑原因，可归结为企业停止服务某些品牌，而带来的负面影响。",
    "use_spk_emb": true,
    "prompt_wav_path": "data/wavs/00000309-00000300.wav"
  }'

Speech generation (upload to object storage and return asset metadata):

curl -X POST http://127.0.0.1:8000/v1/generate \
  -H "Content-Type: application/json" \
  -H "X-HearOpus-Generation-Id: job_demo_store" \
  -d '{
    "task_type": "speech",
    "response_mode": "store",
    "prompt": "Please generate speech based on the following description.\n",
    "text": "请将本段生成音频直接上传对象存储，并返回稳定资产标识。",
    "use_spk_emb": true,
    "prompt_wav_path": "https://assets.example.com/role.wav"
  }'

当前协作版本

0.1.3

当前仓库已补充对 prompt_wav_path URL 输入的支持，可直接从 HTTP/HTTPS 地址下载克隆参考音频，再交给推理流程读取。当前实现会把远程参考音频按“完整下载 URL 哈希”暂存到共享缓存目录，默认 30min 内命中则直接复用，适配同机多进程推理服务。当前 POST /v1/generate 额外支持 response_mode=store：推理端会先本地落盘，再自行上传到对象存储，并返回 audio_asset_name / storage_backend / object_key / bytes / duration_seconds / sample_rate 等元数据。若上游仍使用 stream 或 path，原行为保持不变。当前服务同时补充了多进程诊断日志，会按进程输出 pid/port/request_id、锁等待、提示音频下载、模型生成、响应写出、音频字节数、decode steps 与 step limit 命中情况，方便和上游编排服务联调。

参考音频 URL 缓存

缓存目录默认位于 Ming-omni-tts/runtime-cache/prompt-audio/
可通过环境变量 PROMPT_AUDIO_CACHE_DIR 覆盖缓存目录
缓存 TTL 默认 1800 秒，可通过 PROMPT_AUDIO_CACHE_TTL_SECONDS 覆盖
同一完整 URL 会命中同一个缓存文件；URL 中版本参数变化会自动落到新的缓存文件
下载采用临时文件加原子替换，避免多进程读到半文件

推理诊断响应头

/v1/generate 会返回 X-Request-Id、X-Ming-Pid、X-Ming-Port
同时返回 X-Lock-Wait-Ms、X-Lock-Held-Ms、X-Prompt-Audio-Download-Ms、X-Model-Generate-Ms、X-Response-Write-Ms
长句诊断字段包括 X-Text-Len、X-Max-Decode-Steps、X-Decode-Steps、X-Step-Limit-Reached

About

No description, topics, or website provided.

3.06 GiB

1 forks 0 stars 1 branches 0 TagREADMEMIT license

Release
0

Tag

Packages

dockerfile-caches

Contributors
5

Language

Python84.2%

Shell1.6%

Dockerfile1.5%

Jinja0.5%

Others12.2%

35/F,Tencent Building,Kejizhongyi Avenue,Nanshan District,Shenzhen

京ICP备11018762号-111

Models
audio_tokenizer
cookbooks
data
docker
figures
fm
sentence_manager
service
tests
.cnb.yml
.gitattributes
.gitignore
LEGAL.md
LICENSE
README.md
audio_storage.py
chat_format.py
configuration_bailing_moe.py
configuration_bailingmm.py
download.py
env.sh
ming_audio.py
ming_uniaudio.dockerfile
modeling_bailing_moe.py
modeling_bailingmm.py
prompt_audio_cache.py
requirements-docker-minimal.txt
requirements.txt
service.sh
special_tokens_map.json
spkemb_extractor.py
test.ipynb
tokenization_bailing.py
tokenizer.json
tokenizer_config.json
watch.sh