Hidden in Plain Tokens: Simply Robust, Gradient-Free Watermark for Synthetic Audio

About

As policy catches up with the capabilities of generative AI, watermarking is central to content provenance efforts. Inference-time watermarks for autoregressive models are unfit for continuous modalities due to discretization inconsistencies. Existing methods overcome this by finetuning the modality tokenizers, nullifying the watermark's training-free advantage. In this work, motivated by the vocabulary redundancy of discretization, we propose an elegant solution for powerful and robust watermarking of synthetic audio. We theoretically analyze the impact of token errors on watermark detection, and effectively mitigate them using a reduced vocabulary obtained via community detection. Thorough experiments showcase that our gradient-free method can boost detectability by several orders of magnitude, while also achieving built-in robustness to audio modifications. Broadly, we discover a new state-of-the-art for token-level watermarks in multimedia, which simply arises from the nature of discrete representation learning.

Georgios Milis, Yubin Qin, Yihan Wu, Heng Huang• 2026

Related benchmarks

Task	Dataset	Result
Audio Quality Evaluation	Moshi conversational audio prompts	VGGish Score0.133	13
Audio Quality Evaluation	Moshi LibriSpeech prompts	VGGish Score1.921	13
Watermark Detectability	Conversational prompts	Probability (p)5.466	6
Audio Generation Quality	MusicCaps MusicGen 32kHz (val)	FAD (VGGish)1.256	4
Text-to-Speech	CosyVoice3 generated audio	FAD (VGGish)0.1942	3
Text-to-Speech	Spark-TTS generated audio	FAD (VGGish)0.3506	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord