YourTTS: Towards Zero-Shot Multi-Speaker TTS and Zero-Shot Voice Conversion for everyone
About
YourTTS brings the power of a multilingual approach to the task of zero-shot multi-speaker TTS. Our method builds upon the VITS model and adds several novel modifications for zero-shot multi-speaker and multilingual training. We achieved state-of-the-art (SOTA) results in zero-shot multi-speaker TTS and results comparable to SOTA in zero-shot voice conversion on the VCTK dataset. Additionally, our approach achieves promising results in a target language with a single-speaker dataset, opening possibilities for zero-shot multi-speaker TTS and zero-shot voice conversion systems in low-resource languages. Finally, it is possible to fine-tune the YourTTS model with less than 1 minute of speech and achieve state-of-the-art results in voice similarity and with reasonable quality. This is important to allow synthesis for speakers with a very different voice or recording characteristics from those seen during training.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Automatic Speech Recognition | LibriSpeech (test-other) | WER51.2 | 966 | |
| Automatic Speech Recognition | LibriSpeech clean (test) | WER20.4 | 833 | |
| Zero-shot Text-to-Speech | MLS En filtered (test) | WER0.07 | 15 | |
| Zero-shot Text-to-Speech | MLS Fr filtered (test) | WER10.7 | 15 | |
| Zero-shot Text-to-Speech | MLS Pt filtered (test) | WER13.1 | 15 | |
| Voice Conversion | Librispeech (test-clean) | WER11.93 | 13 | |
| Text-to-Speech | Filtered MLS English (test) | SMOS3.48 | 12 | |
| Diverse Speech Generation | LibriSpeech (test-other) | WER9 | 7 | |
| Speech-to-speech translation | CVSS Fr-En | BLEU16.23 | 7 | |
| Speech-to-speech translation | CVSS Es-En | BLEU21.09 | 7 |