X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning

About

In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.

Rixi Xu, Qingyu Liu, Haitao Li, Yushen Chen, Zhikang Niu, Yunting Yang, Jian Zhao, Ke Li, Berrak Sisman, Qinyuan Cheng, Xipeng Qiu, Kai Yu, Xie Chen• 2026

Related benchmarks

Task	Dataset	Result
Text-to-Speech	X-Voice (test)	WER2.29	186
Subjective Speech Quality Evaluation	X-Voice (test)	IMOS4.64	156
Zero-shot Text-to-Speech	Seed-TTS en (test)	WER1.3	25
Zero-shot Text-to-Speech	Seed-TTS zh (test)	WER1.19	8
Cross-lingual Text-to-Speech	X-Voice (test)	Performance Score (en→it)4.7	6
Text-to-Speech	LEMAS-TTS zh (test)	WER1.38	3
Text-to-Speech	LEMAS-TTS en (test)	WER0.98	3
Text-to-Speech	LEMAS-TTS de (test)	WER7.12	3
Text-to-Speech	LEMAS-TTS es (test)	WER2.7	3
Text-to-Speech	LEMAS-TTS fr (test)	WER5.16	3

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord