Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

About

Zero-shot voice conversion (VC) aims to convert a source utterance into the voice of an unseen target speaker while preserving its linguistic content. Although recent systems have improved conversion quality, building zero-shot VC systems for interactive scenarios remains challenging because high-fidelity speaker transfer and low-latency streaming inference are difficult to achieve simultaneously. In this work, we present X-VC, a zero-shot streaming VC system that performs one-step conversion in the latent space of a pretrained neural codec. X-VC uses a dual-conditioning acoustic converter that jointly models source codec latents and frame-level acoustic conditions derived from target reference speech, while injecting utterance-level target speaker information through adaptive normalization. To reduce the mismatch between training and inference, we train the model with generated paired data and a role-assignment strategy that combines standard, reconstruction, and reversed modes. For streaming inference, we further adopt a chunkwise inference scheme with overlap smoothing that is aligned with the segment-based training paradigm of the codec. Experiments on Seed-TTS-Eval show that X-VC achieves the best streaming WER in both English and Chinese, strong speaker similarity in same-language and cross-lingual settings, and substantially lower offline real-time factor than the compared baselines. These results suggest that codec-space one-step conversion is a practical approach for building high-quality low-latency zero-shot VC systems. Audio samples are available at https://x-vc.github.io. Our code and checkpoints will also be released.

Qixi Zheng, Yuxiang Zhao, Tianrui Wang, Wenxi Chen, Kele Xu, Yikang Li, Qinyuan Chen, Xipeng Qiu, Kai Yu, Xie Chen• 2026

Related benchmarks

TaskDatasetResultRank
Voice ConversionSeed-TTS zh (test)
WER1.99
9
Voice ConversionSeed-TTS en (test)
WER2.83
7
Cross-lingual Voice ConversionSeed-TTS-Eval Chinese-to-English
WER2.15
5
Cross-lingual Voice ConversionSeed-TTS English-to-Chinese (Eval)
WER2.67
4
Zero-shot Voice ConversionSeed-TTS-Eval zh (test)
SMOS Score3.89
3
Zero-shot Voice ConversionSeed-TTS-Eval en (test)
SMOS3.98
2
Showing 6 of 6 rows

Other info

Follow for update