Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS

About

Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.

Junyang Chen, Yuhang Jia, Hui Wang, Jiaming Zhou, Yongchang Gan, Yong Qin• 2026

Related benchmarks

TaskDatasetResultRank
Text-to-SpeechSeed-TTS-Eval zh (test)
CER1.16
21
Speech EditingRealEdit
WER4.31
15
Speech Editing (Substitution)Ming-Freeform-Audio-Edit English (full)
DNSMOS3.05
14
Speech Editing (Substitution)Ming-Freeform-Audio-Edit English (basic)
DNSMOS3.04
14
Speech Editing (Deletion)Ming-Freeform-Audio-Edit English (basic)
DNSMOS3.01
14
Speech Editing (Deletion)Ming-Freeform-Audio-Edit English (full)
DNSMOS3
14
Speech Editing (Insertion)Ming-Freeform-Audio-Edit English (basic)
DNSMOS3.02
14
Speech Editing (Insertion)Ming-Freeform-Audio-Edit English (full)
DNSMOS3.03
14
Multilingual Voice CloningCV3-Eval Multilingual Voice Cloning (hard-en)
WER5.93
6
Speech EditingMing-Freeform-Audio-Edit English Insertion
IMOS4.773
6
Showing 10 of 25 rows

Other info

Follow for update