Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

About

We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, Lei He, Sheng Zhao, Furu Wei• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechLibriSpeech clean (test)
WER19
66
Text-to-Speech SynthesisLibriTTS (CLEAN), LibriVox (NOISY), YouTube (WILD), and My Science Tutor (KIDS) (test)
MOS2.9
21
Text-to-SpeechVCTK
WER7.9
19
Cross-sentence Zero-shot Speech SynthesisLibriSpeech clean (test)
WERH5.9
16
Continuation Zero-shot Speech SynthesisLibrispeech (test-clean)
WERH3.8
15
Voice ConversionLibriTTS (test-clean)
WER2.77
11
Text-to-Speech Synthesis10-second speech segments
Inference Time (s)7.32
8
Text-to-SpeechText-to-Speech (TTS) Benchmark
WER5.9
7
Lyrics-to-vocalsEvaluation set without audio prompt (test)
Musicality3.15
7
Zero-shot Text-to-SpeechLibriSpeech SNR = ∞ (test-clean)
UTMOS3.68
6
Showing 10 of 22 rows

Other info

Follow for update