Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CosyEdit: Unlocking End-to-End Speech Editing Capability from Zero-Shot Text-to-Speech Models

About

Automatic speech editing aims to modify spoken content based on textual instructions, yet traditional cascade systems suffer from complex preprocessing pipelines and a reliance on explicit external temporal alignment. Addressing these limitations, we propose CosyEdit, an end-to-end speech editing model adapted from CosyVoice through task-specific fine-tuning and an optimized inference procedure, which internalizes speech-text alignment while ensuring high consistency between the speech before and after editing. By fine-tuning on only 250 hours of supervised data from our curated GigaEdit dataset, our 400M-parameter model achieves reliable speech editing performance. Experiments on the RealEdit benchmark indicate that CosyEdit not only outperforms several billion-parameter language model baselines but also matches the performance of state-of-the-art cascade approaches. These results demonstrate that, with task-specific fine-tuning and inference optimization, robust and efficient speech editing capabilities can be unlocked from a zero-shot TTS model, yielding a novel and cost-effective end-to-end solution for high-quality speech editing.

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yaxin Han, Mengying Feng, Yong Qin• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechSeed-TTS-Eval zh (test)
CER1.76
21
Speech EditingRealEdit
WER4.5
15
Speech Editing (Deletion)Ming-Freeform-Audio-Edit English (basic)
DNSMOS3.1
14
Speech Editing (Substitution)Ming-Freeform-Audio-Edit English (full)
DNSMOS3.13
14
Speech Editing (Deletion)Ming-Freeform-Audio-Edit English (full)
DNSMOS3.09
14
Speech Editing (Insertion)Ming-Freeform-Audio-Edit English (basic)
DNSMOS3.1
14
Speech Editing (Insertion)Ming-Freeform-Audio-Edit English (full)
DNSMOS3.11
14
Speech Editing (Substitution)Ming-Freeform-Audio-Edit English (basic)
DNSMOS3.11
14
Multilingual Voice CloningCV3-Eval Multilingual Voice Cloning (hard-en)
WER13.93
6
Speech EditingMing-Freeform-Audio-Edit English Insertion
IMOS4.543
6
Showing 10 of 21 rows

Other info

Follow for update