X-VC: Zero-shot Streaming Voice Conversion in Codec Space

About

Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role-assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment-based training paradigm of the codec. Experiments on Seed-TTS-Eval show that X-VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than the compared baselines. These results suggest that codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems. Our audio samples, code and checkpoints are released at https://github.com/Jerrister/X-VC.

Qixi Zheng, Yuxiang Zhao, Tianrui Wang, Wenxi Chen, Kele Xu, Yikang Li, Qinyuan Chen, Xipeng Qiu, Kai Yu, Xie Chen• 2026

Related benchmarks

Task	Dataset	Result
Voice Conversion	Seed-TTS zh (test)	WER1.99	9
Voice Conversion	Seed-TTS en (test)	WER2.83	7
Cross-lingual Voice Conversion	Seed-TTS-Eval Chinese-to-English	WER2.15	5
Cross-lingual Voice Conversion	Seed-TTS English-to-Chinese (Eval)	WER2.67	4
Zero-shot Voice Conversion	Seed-TTS-Eval zh (test)	SMOS Score3.89	3
Zero-shot Voice Conversion	Seed-TTS-Eval en (test)	SMOS3.98	2

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord