StyleTTS 2: Towards Human-Level Text-to-Speech through Style Diffusion and Adversarial Training with Large Speech Language Models

About

In this paper, we present StyleTTS 2, a text-to-speech (TTS) model that leverages style diffusion and adversarial training with large speech language models (SLMs) to achieve human-level TTS synthesis. StyleTTS 2 differs from its predecessor by modeling styles as a latent random variable through diffusion models to generate the most suitable style for the text without requiring reference speech, achieving efficient latent diffusion while benefiting from the diverse speech synthesis offered by diffusion models. Furthermore, we employ large pre-trained SLMs, such as WavLM, as discriminators with our novel differentiable duration modeling for end-to-end training, resulting in improved speech naturalness. StyleTTS 2 surpasses human recordings on the single-speaker LJSpeech dataset and matches it on the multispeaker VCTK dataset as judged by native English speakers. Moreover, when trained on the LibriTTS dataset, our model outperforms previous publicly available models for zero-shot speaker adaptation. This work achieves the first human-level TTS on both single and multispeaker datasets, showcasing the potential of style diffusion and adversarial training with large SLMs. The audio demos and source code are available at https://styletts2.github.io/.

Yinghao Aaron Li, Cong Han, Vinay S. Raghavan, Gavin Mischler, Nima Mesgarani• 2023

Related benchmarks

Task	Dataset	Result
Text-to-Speech	LibriTTS clean (test)	--	37
Text-to-Speech	LJSpeech (test)	CMOS81.48	25
Text-to-Speech Synthesis	LibriTTS (CLEAN), LibriVox (NOISY), YouTube (WILD), and My Science Tutor (KIDS) (test)	MOS3.01	21
Speech Synthesis	LJSpeech	MOS3.83	12
Singing Voice Style Transfer	Multi-singer unseen (M4Singer, OpenSinger, PopBuTFy, GTSinger) (test)	MOS-Q3.71	12
Binaural Speech Generation	MRSDrama complete drama	IPD MAE0.011	9
Text-to-Speech	Difficult-Word Benchmark English 500 words	D-PER0.65	9
Speaker Erasure	LibriTTS 1-speaker setting (retain test)	WER2.75	7
Speaker Erasure	LibriTTS 1-speaker setting (forget test)	WER2.72	7
Monaural Speech Synthesis	MRSDrama monaural, single-sentence	CER4.19	7

Showing 10 of 33 rows

Other info

Code

Follow for update

@wizwand_team Discord