VocBulwark: Towards Practical Generative Speech Watermarking via Additional-Parameter Injection

About

Generated speech achieves human-level naturalness but escalates security risks of misuse. However, existing watermarking methods fail to reconcile fidelity with robustness, as they rely either on simple superposition in the noise space or on intrusive alterations to model weights. To bridge this gap, we propose VocBulwark, an additional-parameter injection framework that freezes generative model parameters to preserve perceptual quality. Specifically, we design a Temporal Adapter to deeply entangle watermarks with acoustic attributes, synergizing with a Coarse-to-Fine Gated Extractor to resist advanced attacks. Furthermore, we develop an Accuracy-Guided Optimization Curriculum that dynamically orchestrates gradient flow to resolve the optimization conflict between fidelity and robustness. Comprehensive experiments demonstrate that VocBulwark achieves high-capacity and high-fidelity watermarking, offering robust defense against complex practical scenarios, with resilience to Codec regenerations and variable-length manipulations.

Weizhi Liu, Yue Li, Zhaoxia Yin• 2026

Related benchmarks

Task	Dataset	Result
Speech Watermarking	LJSpeech 2017	STOI0.9795	17
Speech Watermarking	LJSpeech (in-distribution)	Gaussian Noise (5 dB) Score0.9986	13
Speech Watermarking	LJSpeech (in-distribution)	MP3 (16 kbps) Acc0.9984	13
Generative Speech Watermarking	LibriTTS OOD (test)	STOI0.9789	8
Generative Speech Watermarking	AiShell3 OOD (test)	STOI0.969	8
Generative Speech Watermarking	LJSpeech (test)	Inference Time (ms)13.48	7
Speech Watermarking	LibriTTS (out-of-distribution)	Accuracy (GN 5 dB)96.01	4
Speech Watermarking	AiShell3 (out-of-distribution)	Robustness (Gaussian Noise 5 dB)98.68	4
Speech Watermarking	LibriTTS (OOD)	MP3 16kbps Accuracy0.9891	4
Speech Watermarking	AiShell3 (OOD)	MP3 (16 kbps) Accuracy0.9879	4

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord