M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

About

Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3* training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36\% English, 1.31\% Chinese) while maintaining competitive naturalness scores. Code and demos will be available at https://wwwwxp.github.io/M3-TTS.

Xiaopeng Wang, Chunyu Qiang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Yukun Liu, Yuzhe Liang, Kang Yin, Yuankun Xie, Heng Xie, Chenxing Li, Chen Zhang, Changsheng Li• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS 24 kHz (test-zh)	SIM-o0.762	11
Text-to-Speech	Seed-TTS en 24 kHz (test)	SIM-o0.681	11
Text-to-Speech	English TTS Evaluation (EN) (test)	SIM-o0.604	8
Text-to-Speech	Chinese TTS Evaluation ZH (test)	SIM-o62.1	8
Text-to-Speech	AISHELL3 44.1 kHz (test)	SIM-o0.54	3

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord