X-Voice: Enabling Everyone to Speak 30 Languages via Zero-Shot Cross-Lingual Voice Cloning
About
In this paper, we present X-Voice, a 0.4B multilingual zero-shot voice cloning model that clones arbitrary voices and enables everyone to speak 30 languages. X-Voice is trained on a 420K-hour multilingual corpus using the International Phonetic Alphabet (IPA) as a unified representation. To eliminate the reliance on prompt text without complex preprocessing like forced alignment, we design a two-stage training paradigm. In Stage 1, we establish X-Voice$_{\text{s1}}$ through standard conditional flow-matching training and use it to synthesize 10K hours of speaker-consistent segments as audio prompts. In Stage 2, we fine-tune on these audio pairs with prompt text masked to derive X-Voice$_{\text{s2}}$, which enables zero-shot voice cloning without requiring transcripts of audio prompts. Architecturally, we extend F5-TTS by implementing a dual-level injection of language identifiers and decoupling and scheduling of Classifier-Free Guidance to facilitate multilingual speech synthesis. Subjective and objective evaluation results demonstrate that X-Voice outperforms existing flow-matching based multilingual systems like LEMAS-TTS and achieves zero-shot cross-lingual cloning capabilities comparable to billion-scale models such as Qwen3-TTS. To facilitate research transparency and community advancement, we open-source all related resources.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Speech | X-Voice (test) | WER2.29 | 186 | |
| Subjective Speech Quality Evaluation | X-Voice (test) | IMOS4.64 | 156 | |
| Zero-shot Text-to-Speech | Seed-TTS en (test) | WER1.3 | 25 | |
| Zero-shot Text-to-Speech | Seed-TTS zh (test) | WER1.19 | 8 | |
| Cross-lingual Text-to-Speech | X-Voice (test) | Performance Score (en→it)4.7 | 6 | |
| Text-to-Speech | LEMAS-TTS zh (test) | WER1.38 | 3 | |
| Text-to-Speech | LEMAS-TTS en (test) | WER0.98 | 3 | |
| Text-to-Speech | LEMAS-TTS de (test) | WER7.12 | 3 | |
| Text-to-Speech | LEMAS-TTS es (test) | WER2.7 | 3 | |
| Text-to-Speech | LEMAS-TTS fr (test) | WER5.16 | 3 |