Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone

About

YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.

Edresson Casanova, Julian Weber, Christopher Shulby, Arnaldo Candido Junior, Eren G\"olge, Moacir Antonelli Ponti• 2021

Related benchmarks

TaskDatasetResultRank
Automatic Speech RecognitionLibriSpeech clean (test)
WER20.4
1156
Automatic Speech RecognitionLibriSpeech (test-other)
WER51.2
1151
Speaker Impersonation AttackVoxCeleb1 and CNCeleb subsampled (test-enrollment)
ASR Accuracy100
36
Response RankingParaS2SBench
Age Score4.41
16
Zero-shot Text-to-SpeechMLS En filtered (test)
WER0.07
15
Zero-shot Text-to-SpeechMLS Fr filtered (test)
WER10.7
15
Zero-shot Text-to-SpeechMLS Pt filtered (test)
WER13.1
15
Voice ConversionLibrispeech (test-clean)
WER11.93
13
Text-to-SpeechFiltered MLS English (test)
SMOS3.48
12
Speech Role-PlayingActorMindBench
Phoebe Score2.9
7
Showing 10 of 32 rows

Other info

Follow for update