KALL-E:Autoregressive Speech Synthesis with Next-Distribution Prediction

About

We introduce KALL-E, a novel autoregressive (AR) language model for text-to-speech (TTS) synthesis that operates by predicting the next distribution of continuous speech frames. Unlike existing methods, KALL-E directly models the continuous speech distribution conditioned on text, eliminating the need for any diffusion-based components. Specifically, we utilize a Flow-VAE to extract a continuous latent speech representation from waveforms, instead of relying on discrete speech tokens. A single AR Transformer is then trained to predict these continuous speech distributions from text, optimizing a Kullback-Leibler divergence loss as its objective. Experimental results demonstrate that KALL-E achieves superior speech synthesis quality and can even adapt to a target speaker from just a single sample. Importantly, KALL-E provides a more direct and effective approach for utilizing continuous speech representations in TTS.

Kangxiang Xia, Xinfa Zhu, Jixun Yao, Wenjie Tian, Wenhao Li, Lei Xie• 2024

Related benchmarks

Task	Dataset	Result	Rank
Text-to-Speech	Seed-TTS en (test)	WER1.94		121
Text-to-Speech	Seed-TTS zh (test)	WER0.96		87

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord