Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

About

While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve intelligibility at the expense of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, a novel VAE framework that utilizes semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. Extensive experiments demonstrate that Semantic-VAE significantly improves synthesis quality and training efficiency. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59). We also release the code and models to facilitate further research.

Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, Xie Chen• 2025

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech (test-other)	WER21.32	1206
Automatic Speech Recognition	AISHELL-1 (test)	--	105
Automatic Speech Recognition	Librispeech (test-clean)	WER9.69	96
Text-to-Speech	Seed-TTS zh (test)	--	87
Text-to-Speech	Seed-TTS-Eval (test)	WER1.42	32
Text-to-Speech	Seed-TTS Seed-EN (test)	WER1.0232	32
Speech Reconstruction	LibriSpeech clean (test)	WER4.07	25
Speech Processing	SUPERB	KWS Acc0.4349	24
Text-to-speech generation	LibriSpeech-PC (test-clean)	WER2.01	16
General Speech Representation Evaluation	Combined (LibriSpeech-clean, SUPERB, LibriSpeech-PC) (test)	Overall Score69	10

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord