SSR-Speech: Towards Stable, Safe and Robust Zero-shot Text-based Speech Editing and Synthesis

About

In this paper, we introduce SSR-Speech, a neural codec autoregressive model designed for stable, safe, and robust zero-shot textbased speech editing and text-to-speech synthesis. SSR-Speech is built on a Transformer decoder and incorporates classifier-free guidance to enhance the stability of the generation process. A watermark Encodec is proposed to embed frame-level watermarks into the edited regions of the speech so that which parts were edited can be detected. In addition, the waveform reconstruction leverages the original unedited speech segments, providing superior recovery compared to the Encodec model. Our approach achieves state-of-the-art performance in the RealEdit speech editing task and the LibriTTS text-to-speech task, surpassing previous methods. Furthermore, SSR-Speech excels in multi-span speech editing and also demonstrates remarkable robustness to background sounds. The source code and demos are released.

Helin Wang, Meng Yu, Jiarui Hai, Chen Chen, Yuchen Hu, Rilin Chen, Najim Dehak, Dong Yu• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Speech	Seed-TTS-Eval zh (test)	CER20.51	21
Speech Editing	RealEdit	WER5.05	15
Speech Editing (Insertion)	Ming-Freeform-Audio-Edit English (basic)	DNSMOS3.06	14
Speech Editing (Substitution)	Ming-Freeform-Audio-Edit English (basic)	DNSMOS3.08	14
Speech Editing (Substitution)	Ming-Freeform-Audio-Edit English (full)	DNSMOS3.08	14
Speech Editing (Insertion)	Ming-Freeform-Audio-Edit English (full)	DNSMOS3.06	14
Speech Editing (Deletion)	Ming-Freeform-Audio-Edit English (basic)	DNSMOS3.03	14
Speech Editing (Deletion)	Ming-Freeform-Audio-Edit English (full)	DNSMOS3.02	14
Speech Editing	Ming-Freeform-Audio-Edit English Insertion	IMOS4.55	6
Speech Editing	Ming-Freeform-Audio-Edit English Deletion	IMOS4.597	6

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord