logo
0
0
WeChat Login

LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

LongCat-AudioDiT

Introduction

LongCat-AudioDiT is a state-of-the-art (SOTA) diffusion-based text-to-speech (TTS) model that directly operates on the waveform latent space.

Abstract: We present LongCat-TTS, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-TTS lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-TTS achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-TTS-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.

LongCat-AudioDiT

This repository provides the HuggingFace-compatible implementation, including model definition, weight conversion, and inference scripts.

Experimental Results on Seed Benchmark

LongCat-AudioDiT obtains state-of-the-art (SOTA) voice cloning performance on the Seed-benchmark, surpassing both close-source and open-source modles.

ModelZH CER (%)ZH SIMEN WER (%)EN SIMZH-Hard CER (%)ZH-Hard SIM
GT1.260.7552.140.734--
Seed-DiT1.180.8091.730.790--
MaskGCT2.270.7742.620.71410.270.748
E2 TTS1.970.7302.190.710--
F5 TTS1.560.7411.830.6478.670.713
F5R-TTS1.370.754--8.790.718
ZipVoice1.400.7511.640.668--
Seed-ICL1.120.7962.250.7627.590.776
SparkTTS1.200.6721.980.584--
FireRedTTS1.510.6353.820.46017.450.621
Qwen2.5-Omni1.700.7522.720.6327.970.747
Qwen2.5-Omni_RL1.420.7542.330.6416.540.752
CosyVoice3.630.7234.290.60911.750.709
CosyVoice21.450.7482.570.6526.830.724
FireRedTTS-1S1.050.7502.170.6607.630.748
CosyVoice3-1.5B1.120.7812.210.7205.830.758
IndexTTS21.030.7652.230.7067.120.755
DiTAR1.020.7531.690.735--
MiniMax-Speech0.990.7991.900.738--
VoxCPM0.930.7721.850.7298.870.730
MOSS-TTS1.200.7881.850.734--
Qwen3-TTS1.220.7701.230.7176.760.748
CosyVoice3.50.870.7971.570.7385.710.786
LongCat-AudioDiT-1B1.180.8121.780.7626.330.787
LongCat-AudioDiT-3.5B1.090.8181.500.7866.040.797

Installation

pip install -r requirements.txt

CLI Inference

# TTS python inference.py --text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" --output_audio output.wav --model_dir meituan-longcat/LongCat-AudioDiT-1B # Voice cloning python inference.py \ --text "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" \ --prompt_text "小偷却一点也不气馁,继续在抽屉里翻找。" \ --prompt_audio assets/prompt.wav \ --output_audio output.wav \ --model_dir meituan-longcat/LongCat-AudioDiT-1B \ --guidance_method apg # Batch inference (SeedTTS eval format, one item per line: uid|prompt_text|prompt_wav_path|gen_text) python batch_inference.py \ --lst /path/to/meta.lst \ --output_dir /path/to/output \ --model_dir meituan-longcat/LongCat-AudioDiT-1B \ --guidance_method apg

Inference (Python API)

1. TTS

import audiodit # auto-registers with transformers from audiodit import AudioDiTModel from transformers import AutoTokenizer import torch, soundfile as sf # Load model model = AudioDiTModel.from_pretrained("meituan-longcat/LongCat-AudioDiT-1B").to("cuda") model.vae.to_half() # VAE runs in fp16 (matching original) model.eval() tokenizer = AutoTokenizer.from_pretrained(model.config.text_encoder_model) # Zero-shot synthesis inputs = tokenizer(["今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。"], padding="longest", return_tensors="pt") output = model( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, duration=62, # latent frames steps=16, cfg_strength=4.0, guidance_method="cfg", # or "apg" seed=1024, ) sf.write("output.wav", output.waveform.squeeze().cpu().numpy(), 24000)

2. Voice Cloning (with prompt audio)

import librosa, torch # Load prompt audio audio, _ = librosa.load("assets/prompt.wav", sr=24000, mono=True) prompt_wav = torch.from_numpy(audio).unsqueeze(0).unsqueeze(0) # (1, 1, T) # Concatenate prompt_text + gen_text for the text encoder prompt_text = "小偷却一点也不气馁,继续在抽屉里翻找。" gen_text = "今天晴暖转阴雨,空气质量优至良,空气相对湿度较低。" inputs = tokenizer([f"{prompt_text} {gen_text}"], padding="longest", return_tensors="pt") output = model( input_ids=inputs.input_ids, attention_mask=inputs.attention_mask, prompt_audio=prompt_wav, duration=138, # prompt_frames + gen_frames steps=16, cfg_strength=4.0, guidance_method="apg", seed=1024, )

License Agreement

This repository, including both the model weights and the source code, is released under the MIT License.

Any contributions to this repository are licensed under the MIT License, unless otherwise stated. This license does not grant any rights to use Meituan trademarks or patents.

For details, see the LICENSE file.

About

No description, topics, or website provided.
14.28 GiB
0 forks0 stars1 branches0 TagREADMEMIT license