RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

About

Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.

Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng, Shiyu Liang• 2026

Related benchmarks

Task	Dataset	Result
LLM Watermark Spoofing	EWD	SSR56.5	20
LLM Watermark Spoofing	SWEET	SSR54.5	20
LLM Watermark Spoofing	Unigram	SSR54.8	20
LLM Watermark Spoofing	PF	SSR62	20
LLM Watermark Spoofing	PMark	SSR36.3	20
Watermark Spoofing	Unigram Watermarking Scheme	SSR54.8	20
Watermark Spoofing	PF watermarking scheme	SSR62	20
Watermark Spoofing	PMark	SSR36.3	20
Watermark Spoofing	EWD (test)	SSR56.5	20
Watermark Spoofing	SWEET (test)	SSR54.5	20

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord