LongCat-AudioDiT: High-Fidelity Diffusion Text-to-Speech in the Waveform Latent Space

About

We present LongCat-AudioDiT, a novel, non-autoregressive diffusion-based text-to-speech (TTS) model that achieves state-of-the-art (SOTA) performance. Unlike previous methods that rely on intermediate acoustic representations such as mel-spectrograms, the core innovation of LongCat-AudioDiT lies in operating directly within the waveform latent space. This approach effectively mitigates compounding errors and drastically simplifies the TTS pipeline, requiring only a waveform variational autoencoder (Wav-VAE) and a diffusion backbone. Furthermore, we introduce two critical improvements to the inference process: first, we identify and rectify a long-standing training-inference mismatch; second, we replace traditional classifier-free guidance with adaptive projection guidance to elevate generation quality. Experimental results demonstrate that, despite the absence of complex multi-stage training pipelines or high-quality human-annotated datasets, LongCat-AudioDiT achieves SOTA zero-shot voice cloning performance on the Seed benchmark while maintaining competitive intelligibility. Specifically, our largest variant, LongCat-AudioDiT-3.5B, outperforms the previous SOTA model (Seed-TTS), improving the speaker similarity (SIM) scores from 0.809 to 0.818 on Seed-ZH, and from 0.776 to 0.797 on Seed-Hard. Finally, through comprehensive ablation studies and systematic analysis, we validate the effectiveness of our proposed modules. Notably, we investigate the interplay between the Wav-VAE and the TTS backbone, revealing the counterintuitive finding that superior reconstruction fidelity in the Wav-VAE does not necessarily lead to better overall TTS performance. Code and model weights are released to foster further research within the speech community.

Detai Xin, Shujie Hu, Chengzuo Yang, Chen Huang, Guoqiao Yu, Guanglu Wan, Xunliang Cai• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS en (test)	WER1.94	159
Text-to-Speech	Seed-TTS zh (test)	--	87
Speech Reconstruction	LibriTTS clean (test)	PESQ3.237	67
Voice Cloning	Seed-TTS en (test)	WER1.5	53
Text-to-Speech	Seed-ZH	CER1.09	42
Text-to-Speech	Seed EN	WER1.5	41
Voice Cloning	Seed-TTS-Eval zh (test)	CER1.09	37
Voice Cloning	Seed-TTS-Eval zh-hard (test)	CER6.04	18
Text-to-Speech	Seed ZH-Hard	CER6.04	15
Voice Cloning	Subjective Evaluation Dataset	N-MOS4.63	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord