NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

About

While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

Zeqian Ju, Yuancheng Wang, Kai Shen, Xu Tan, Detai Xin, Dongchao Yang, Yanqing Liu, Yichong Leng, Kaitao Song, Siliang Tang, Zhizheng Wu, Tao Qin, Xiang-Yang Li, Wei Ye, Shikun Zhang, Jiang Bian, Lei He, Jinyu Li, Sheng Zhao• 2024

Related benchmarks

Task	Dataset	Result
Automatic Speech Recognition	LibriSpeech clean (test)	WER8.2	1410
Text-to-Speech	LibriSpeech clean (test)	WER1.81	97
Speech Reconstruction	LibriTTS clean (test)	PESQ2.2532	67
Speech Reconstruction	LibriSpeech clean (test)	UTMOS Score4.11	60
Speech Reconstruction	LibriTTS (test-other)	UTMOS3.48	57
Speech Recognition	Switchboard	WER25.5	37
Voice Conversion	VCTK	WER0.7	27
Audio Reconstruction	LJSpeech	UTMOS3.976	26
Speaker De-identification	ADReSS	AUC74	20
Speech Resynthesis	LibriTTS (test-clean)	WER3.49	17

Showing 10 of 33 rows

Other info

Follow for update

@wizwand_team Discord