VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing
About
We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Speech | Seed-TTS-Eval zh (test) | CER3.26 | 21 | |
| Speech Editing (Insertion) | Ming-Freeform-Audio-Edit English (full) | DNSMOS3.06 | 14 | |
| Speech Editing (Insertion) | Ming-Freeform-Audio-Edit English (basic) | DNSMOS3.05 | 14 | |
| Speech Editing (Substitution) | Ming-Freeform-Audio-Edit English (full) | DNSMOS3.05 | 14 | |
| Speech Editing (Substitution) | Ming-Freeform-Audio-Edit English (basic) | DNSMOS3.04 | 14 | |
| Speech Editing (Deletion) | Ming-Freeform-Audio-Edit English (basic) | DNSMOS3.01 | 14 | |
| Speech Editing (Deletion) | Ming-Freeform-Audio-Edit English (full) | DNSMOS3 | 14 | |
| Multilingual Voice Cloning | CV3-Eval Multilingual Voice Cloning (hard-en) | WER28.64 | 6 | |
| Speech Editing | Ming-Freeform-Audio-Edit Chinese Deletion | IMOS4.313 | 5 | |
| Multilingual Voice Cloning | CV3-Eval Multilingual Voice Cloning (hard-zh) | CER (%)26.36 | 5 |