Semantic-VAE: Semantic-Alignment Latent Representation for Better Speech Synthesis
About
While mel-spectrograms have been widely utilized as intermediate representations in zero-shot text-to-speech (TTS), their inherent redundancy leads to inefficiency in learning text-speech alignment. Compact VAE-based latent representations have recently emerged as a stronger alternative, but they also face a fundamental optimization dilemma: higher-dimensional latent spaces improve reconstruction quality and speaker similarity, but degrade intelligibility, while lower-dimensional spaces improve intelligibility at the expense of reconstruction fidelity. To overcome this dilemma, we propose Semantic-VAE, a novel VAE framework that utilizes semantic alignment regularization in the latent space. This design alleviates the reconstruction-generation trade-off by capturing semantic structure in high-dimensional latent representations. Extensive experiments demonstrate that Semantic-VAE significantly improves synthesis quality and training efficiency. When integrated into F5-TTS, our method achieves 2.10% WER and 0.64 speaker similarity on LibriSpeech-PC, outperforming mel-based systems (2.23%, 0.60) and vanilla acoustic VAE baselines (2.65%, 0.59). We also release the code and models to facilitate further research.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Speech Processing | SUPERB | KWS Acc0.4349 | 24 | |
| Speech Reconstruction | LibriSpeech clean (test) | -- | 19 | |
| Text-to-speech generation | LibriSpeech-PC (test-clean) | WER2.01 | 10 | |
| General Speech Representation Evaluation | Combined (LibriSpeech-clean, SUPERB, LibriSpeech-PC) (test) | Overall Score69 | 10 |