Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis

About

While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve intelligibility at the expense of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, a novel VAE framework that utilizes semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. Extensive experiments demonstrate that Semantic-VAE significantly improves synthesis quality and training efficiency. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59). We also release the code and models to facilitate further research.

Zhikang Niu, Shujie Hu, Jeongsoo Choi, Yushen Chen, Peining Chen, Pengcheng Zhu, Yunting Yang, Bowen Zhang, Jian Zhao, Chunhui Wang, Xie Chen• 2025

Related benchmarks

TaskDatasetResultRank
Speech ProcessingSUPERB
KWS Acc0.4349
24
Speech ReconstructionLibriSpeech clean (test)--
19
Text-to-speech generationLibriSpeech-PC (test-clean)
WER2.01
10
General Speech Representation EvaluationCombined (LibriSpeech-clean, SUPERB, LibriSpeech-PC) (test)
Overall Score69
10
Showing 4 of 4 rows

Other info

Follow for update