Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RLSpoofer: A Lightweight Evaluator for LLM Watermark Spoofing Resilience

About

Large language model (LLM) watermarking has emerged as a promising approach for detecting and attributing AI-generated text, yet its robustness to black-box spoofing remains insufficiently evaluated. Existing evaluation methods often demand extensive datasets and white-box access to algorithmic internals, limiting their practical applicability. In this paper, we study watermark resilience against spoofing fundamentally from a distributional perspective. We first establish a \textit{local capacity bottleneck}, which theoretically characterizes the probability mass that can be reallocated under KL-bounded local updates while preserving semantic fidelity. Building on this, we propose RLSpoofer, a reinforcement learning-based black-box spoofing attack that requires only 100 human-watermarked paraphrase training pairs and zero access to the watermarking internals or detectors. Despite weak supervision, it empowers a 4B model to achieve a 62.0\% spoof success rate with minimal semantic shift on PF-marked texts, dwarfing the 6\% of baseline models trained on up to 10,000 samples. Our findings expose the fragile spoofing resistance of current LLM watermarking paradigms, providing a lightweight evaluation framework and stressing the urgent need for more robust schemes.

Hanbo Huang, Xuan Gong, Yiran Zhang, Hao Zheng, Shiyu Liang• 2026

Related benchmarks

TaskDatasetResultRank
LLM Watermark SpoofingEWD
SSR56.5
20
LLM Watermark SpoofingSWEET
SSR54.5
20
LLM Watermark SpoofingUnigram
SSR54.8
20
LLM Watermark SpoofingPF
SSR62
20
LLM Watermark SpoofingPMark
SSR36.3
20
Watermark SpoofingUnigram Watermarking Scheme
SSR54.8
20
Watermark SpoofingPF watermarking scheme
SSR62
20
Watermark SpoofingPMark
SSR36.3
20
Watermark SpoofingEWD (test)
SSR56.5
20
Watermark SpoofingSWEET (test)
SSR54.5
20
Showing 10 of 14 rows

Other info

Follow for update