GROOT: Generating Robust Watermark for Diffusion-Model-Based Audio Synthesis
About
Amid the burgeoning development of generative models like diffusion models, the task of differentiating synthesized audio from its natural counterpart grows more daunting. Deepfake detection offers a viable solution to combat this challenge. Yet, this defensive measure unintentionally fuels the continued refinement of generative models. Watermarking emerges as a proactive and sustainable tactic, preemptively regulating the creation and dissemination of synthesized content. Thus, this paper, as a pioneer, proposes the generative robust audio watermarking method (Groot), presenting a paradigm for proactively supervising the synthesized audio and its source diffusion models. In this paradigm, the processes of watermark generation and audio synthesis occur simultaneously, facilitated by parameter-fixed diffusion models equipped with a dedicated encoder. The watermark embedded within the audio can subsequently be retrieved by a lightweight decoder. The experimental results highlight Groot's outstanding performance, particularly in terms of robustness, surpassing that of the leading state-of-the-art methods. Beyond its impressive resilience against individual post-processing attacks, Groot exhibits exceptional robustness when facing compound attacks, maintaining an average watermark extraction accuracy of around 95%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Watermarking | LJSpeech | PESQ3.9574 | 88 | |
| Speech Watermarking | LJSpeech 2017 | STOI0.9589 | 17 | |
| Speech Watermarking | LJSpeech (in-distribution) | Gaussian Noise (5 dB) Score0.9913 | 13 | |
| Speech Watermarking | LJSpeech (in-distribution) | MP3 (16 kbps) Acc0.745 | 13 | |
| Audio Watermarking | LibriTTS | PESQ3.2867 | 8 | |
| Audio Watermarking | LibriSpeech | PESQ3.2416 | 8 | |
| Generative Speech Watermarking | LJSpeech (test) | Inference Time (ms)153.3 | 7 |