CosyEdit2: Speech-Editing-Oriented Reinforcement Learning Unlocks Better Zero-Shot TTS
About
Speech editing and zero-shot Text-to-Speech (TTS) share a similar generative foundation conditioned on speech prompts, yet speech editing demands far stricter local acoustic consistency with surrounding unedited content. While prior work has shown that Supervised Fine-Tuning (SFT) enables TTS models to acquire functional editing capability, this approach remains fundamentally bottlenecked by imperfect paired editing data and coarse-grained optimization signals. To address these limitations, we propose CosyEdit2, a speech editing model built on a two-stage post-training framework that progresses from supervised editing initialization to editing-oriented Group Relative Policy Optimization (GRPO) over target-speech-free data. Extensive experiments demonstrate that CosyEdit2 not only substantially advances speech editing performance, but also unlocks better zero-shot TTS capability, revealing a deeper mutual relationship between the two tasks. Audio samples are available at https://cjy1018.github.io/CosyEdit2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Speech | Seed-TTS-Eval zh (test) | CER1.16 | 21 | |
| Speech Editing | RealEdit | WER4.31 | 15 | |
| Speech Editing (Substitution) | Ming-Freeform-Audio-Edit English (full) | DNSMOS3.05 | 14 | |
| Speech Editing (Substitution) | Ming-Freeform-Audio-Edit English (basic) | DNSMOS3.04 | 14 | |
| Speech Editing (Deletion) | Ming-Freeform-Audio-Edit English (basic) | DNSMOS3.01 | 14 | |
| Speech Editing (Deletion) | Ming-Freeform-Audio-Edit English (full) | DNSMOS3 | 14 | |
| Speech Editing (Insertion) | Ming-Freeform-Audio-Edit English (basic) | DNSMOS3.02 | 14 | |
| Speech Editing (Insertion) | Ming-Freeform-Audio-Edit English (full) | DNSMOS3.03 | 14 | |
| Multilingual Voice Cloning | CV3-Eval Multilingual Voice Cloning (hard-en) | WER5.93 | 6 | |
| Speech Editing | Ming-Freeform-Audio-Edit English Insertion | IMOS4.773 | 6 |