Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Zero-shot Voice Conversion with Diffusion Transformers

About

Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.

Songting Liu• 2024

Related benchmarks

TaskDatasetResultRank
Whisper-to-Normal speech conversionWTIMIT English (test)
UTMOS3.321
12
Voice ConversionLibriTTS (test-clean)
WER2.51
11
Singing Voice ConversionSVC GT Leading (test)
Speaker Similarity0.801
10
Zero-shot Voice ImitationSeedTTS vc-en (test)
UTMOS2.94
10
Voice ConversionSeed-TTS zh (test)
WER1.79
9
Voice ConversionSeedTTS VC English (test)
WER2.97
8
Voice ConversionSeedTTS VC Chinese (test)
WER2.45
8
Singing Voice ConversionSVC English
WER22.86
8
Singing Voice ConversionChinese SVC
WER15.65
8
Whisper-to-normal conversionAISHELL6 Whisper (test)
DNSMOS Overall Score2.868
8
Showing 10 of 23 rows

Other info

Follow for update