Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
About
We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Speech | LibriSpeech clean (test) | WER19 | 50 | |
| Cross-sentence Zero-shot Speech Synthesis | LibriSpeech clean (test) | WERH5.9 | 16 | |
| Continuation Zero-shot Speech Synthesis | Librispeech (test-clean) | WERH3.8 | 15 | |
| Text-to-Speech | VCTK | WER7.9 | 13 | |
| Voice Conversion | LibriTTS (test-clean) | WER2.77 | 11 | |
| Text-to-Speech Synthesis | 10-second speech segments | Inference Time (s)7.32 | 8 | |
| Text-to-Speech | Text-to-Speech (TTS) Benchmark | WER5.9 | 7 | |
| Lyrics-to-vocals | Evaluation set without audio prompt (test) | Musicality3.15 | 7 | |
| Zero-shot Text-to-Speech | LibriSpeech SNR = ∞ (test-clean) | UTMOS3.68 | 6 | |
| Text-to-Speech | VCTK zero-shot | WER7.9 | 6 |