Proactive Detection of Voice Cloning with Localized Watermarking
About
In the rapidly evolving field of speech generative models, there is a pressing need to ensure audio authenticity against the risks of voice cloning. We present AudioSeal, the first audio watermarking technique designed specifically for localized detection of AI-generated speech. AudioSeal employs a generator/detector architecture trained jointly with a localization loss to enable localized watermark detection up to the sample level, and a novel perceptual loss inspired by auditory masking, that enables AudioSeal to achieve better imperceptibility. AudioSeal achieves state-of-the-art performance in terms of robustness to real life audio manipulations and imperceptibility based on automatic and human evaluation metrics. Additionally, AudioSeal is designed with a fast, single-pass detector, that significantly surpasses existing models in speed - achieving detection up to two orders of magnitude faster, making it ideal for large-scale and real-time applications.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Watermarking | LJSpeech | PESQ1.7863 | 88 | |
| Speech Watermarking | LJSpeech 2017 | STOI0.9971 | 17 | |
| Speech Watermarking | LJSpeech (in-distribution) | Gaussian Noise (5 dB) Score0.5951 | 13 | |
| Speech Watermarking | LJSpeech (in-distribution) | MP3 (16 kbps) Acc0.6042 | 13 | |
| Audio Watermarking | LibriTTS | PESQ1.6842 | 8 | |
| Audio Watermarking | LibriSpeech | PESQ1.6523 | 8 |