Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

M3-TTS: Multi-modal DiT Alignment & Mel-latent for Zero-shot High-fidelity Speech Synthesis

About

Non-autoregressive (NAR) text-to-speech synthesis relies on length alignment between text sequences and audio representations, constraining naturalness and expressiveness. Existing methods depend on duration modeling or pseudo-alignment strategies that severely limit naturalness and computational efficiency. We propose M3-TTS, a concise and efficient NAR TTS paradigm based on multi-modal diffusion transformer (MM-DiT) architecture. M3-TTS employs joint diffusion transformer layers for cross-modal alignment, achieving stable monotonic alignment between variable-length text-speech sequences without pseudo-alignment requirements. Single diffusion transformer layers further enhance acoustic detail modeling. The framework integrates a mel-vae codec that provides 3* training acceleration. Experimental results on Seed-TTS and AISHELL-3 benchmarks demonstrate that M3-TTS achieves state-of-the-art NAR performance with the lowest word error rates (1.36\% English, 1.31\% Chinese) while maintaining competitive naturalness scores. Code and demos will be available at https://wwwwxp.github.io/M3-TTS.

Xiaopeng Wang, Chunyu Qiang, Ruibo Fu, Zhengqi Wen, Xuefei Liu, Yukun Liu, Yuzhe Liang, Kang Yin, Yuankun Xie, Heng Xie, Chenxing Li, Chen Zhang, Changsheng Li• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechSeed-TTS 24 kHz (test-zh)
SIM-o0.762
11
Text-to-SpeechSeed-TTS en 24 kHz (test)
SIM-o0.681
11
Text-to-SpeechEnglish TTS Evaluation (EN) (test)
SIM-o0.604
8
Text-to-SpeechChinese TTS Evaluation ZH (test)
SIM-o62.1
8
Text-to-SpeechAISHELL3 44.1 kHz (test)
SIM-o0.54
3
Showing 5 of 5 rows

Other info

Follow for update