Zero-shot Voice Conversion with Diffusion Transformers
About
Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Whisper-to-Normal speech conversion | WTIMIT English (test) | UTMOS3.321 | 12 | |
| Voice Conversion | LibriTTS (test-clean) | WER2.51 | 11 | |
| Singing Voice Conversion | SVC GT Leading (test) | Speaker Similarity0.801 | 10 | |
| Zero-shot Voice Imitation | SeedTTS vc-en (test) | UTMOS2.94 | 10 | |
| Voice Conversion | Seed-TTS zh (test) | WER1.79 | 9 | |
| Voice Conversion | SeedTTS VC English (test) | WER2.97 | 8 | |
| Voice Conversion | SeedTTS VC Chinese (test) | WER2.45 | 8 | |
| Singing Voice Conversion | SVC English | WER22.86 | 8 | |
| Singing Voice Conversion | Chinese SVC | WER15.65 | 8 | |
| Whisper-to-normal conversion | AISHELL6 Whisper (test) | DNSMOS Overall Score2.868 | 8 |