NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers

About

Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use language models to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at https://speechresearch.github.io/naturalspeech2.

Kai Shen, Zeqian Ju, Xu Tan, Yanqing Liu, Yichong Leng, Lei He, Tao Qin, Sheng Zhao, Jiang Bian• 2023

Related benchmarks

Task	Dataset	Result
Text-to-Speech	LibriSpeech clean (test)	WER9	97
Text-to-Speech	Seed-ZH	CER1.4	42
Text-to-Speech	Seed EN	WER1.64	41
Text-to-Speech Synthesis	LibriTTS (CLEAN), LibriVox (NOISY), YouTube (WILD), and My Science Tutor (KIDS) (test)	MOS2.05	21
Voice Conversion	LibriTTS (test-clean)	WER2.94	11
Zero-shot Text-to-Speech	TTS tasks Zero-shot	UTMOS2.38	10
Text-to-Speech	Text-to-Speech (TTS) Benchmark	WER2.3	7
In-context Text-to-Speech	LibriSpeech clean (test)	Word Error Rate (WER)2.6	6
Zero-shot Text-to-Speech	LibriSpeech SNR = ∞ (test-clean)	UTMOS2.38	6
Zero-shot Text-to-Speech	LibriSpeech SNR = 12dB (test-clean)	UTMOS1.66	6

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord