Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VoiceCraft-X: Unifying Multilingual, Voice-Cloning Speech Synthesis and Speech Editing

About

We introduce VoiceCraft-X, an autoregressive neural codec language model which unifies multilingual speech editing and zero-shot Text-to-Speech (TTS) synthesis across 11 languages: English, Mandarin, Korean, Japanese, Spanish, French, German, Dutch, Italian, Portuguese, and Polish. VoiceCraft-X utilizes the Qwen3 large language model for phoneme-free cross-lingual text processing and a novel token reordering mechanism with time-aligned text and speech tokens to handle both tasks as a single sequence generation problem. The model generates high-quality, natural-sounding speech, seamlessly creating new audio or editing existing recordings within one framework. VoiceCraft-X shows robust performance in diverse linguistic settings, even with limited per-language data, underscoring the power of unified autoregressive approaches for advancing complex, real-world multilingual speech applications. Audio samples are available at https://zhishengzheng.com/voicecraft-x/.

Zhisheng Zheng, Puyuan Peng, Anuj Diwan, Cong Phuoc Huynh, Xiaohang Sun, Zhu Liu, Vimal Bhat, David Harwath• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechSeed-TTS-Eval zh (test)
CER3.26
21
Speech Editing (Insertion)Ming-Freeform-Audio-Edit English (full)
DNSMOS3.06
14
Speech Editing (Insertion)Ming-Freeform-Audio-Edit English (basic)
DNSMOS3.05
14
Speech Editing (Substitution)Ming-Freeform-Audio-Edit English (full)
DNSMOS3.05
14
Speech Editing (Substitution)Ming-Freeform-Audio-Edit English (basic)
DNSMOS3.04
14
Speech Editing (Deletion)Ming-Freeform-Audio-Edit English (basic)
DNSMOS3.01
14
Speech Editing (Deletion)Ming-Freeform-Audio-Edit English (full)
DNSMOS3
14
Multilingual Voice CloningCV3-Eval Multilingual Voice Cloning (hard-en)
WER28.64
6
Speech EditingMing-Freeform-Audio-Edit Chinese Deletion
IMOS4.313
5
Multilingual Voice CloningCV3-Eval Multilingual Voice Cloning (hard-zh)
CER (%)26.36
5
Showing 10 of 23 rows

Other info

Follow for update