Zero-shot Voice Conversion with Diffusion Transformers

About

Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.

Songting Liu• 2024

Related benchmarks

Task	Dataset	Result
Voice Conversion	LibriSpeech English (test)	Speaker Similarity0.63	20
Pitch Style Conversion	VocalSet and GTSinger	nMOS3.927	18
Text-to-Speech	Seed-TTS English (test)	WER2.57	14
Whisper-to-Normal speech conversion	WTIMIT English (test)	UTMOS3.321	12
Voice Conversion	LibriTTS (test-clean)	WER2.51	11
Singing Voice Conversion	SVC GT Leading (test)	Speaker Similarity0.801	10
Zero-shot Voice Imitation	SeedTTS vc-en (test)	UTMOS2.94	10
Voice Conversion	Seed-TTS zh (test)	WER1.79	9
Voice Conversion	SeedTTS VC English (test)	WER2.97	8
Voice Conversion	SeedTTS VC Chinese (test)	WER2.45	8

Showing 10 of 31 rows

Other info

Follow for update

@wizwand_team Discord