Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RLCracker: Evaluating the Worst-Case Vulnerability of LLM Watermarks with Adaptive RL Attacks

About

Large language model (LLM) watermarking has shown promise in detecting AI-generated content and mitigating misuse, with prior work claiming robustness against paraphrasing and text editing. In this paper, we argue that existing evaluations are not sufficiently adversarial, obscuring critical vulnerabilities and overstating the security. To address this, we introduce the adaptive robustness radius, a formal metric that quantifies the worst-case resilience of watermarks against adaptive adversaries. By lifting the paraphrase space into a KL-divergence ball, we approximate this radius and theoretically demonstrate that optimizing the attack context and model parameters can significantly reduce the approximate radius, making watermarks highly vulnerable to paraphrase attacks. Leveraging this insight, we propose RLCracker, a reinforcement learning (RL)-based adaptive attack that erases watermark signals with limited watermarked examples and limited access to the detector. Despite weak supervision, it empowers a 3B model to achieve 98.5% removal success with minimal semantic shift on 1,500-token Unigram-marked texts after training on only 100 short samples. This performance dramatically exceeds 6.75% by GPT-4o and generalizes across five model sizes over ten watermarking schemes. Our code is available at https://github.com/OTT0-OTO/RLCracker.

Hanbo Huang, Yiran Zhang, Hao Zheng, Xuan Gong, Yihan Li, Lin Liu, Zhuotao Liu, Shiyu Liang• 2025

Related benchmarks

TaskDatasetResultRank
Watermark RemovalWatermarked Text 500 tokens
EWD94.8
30
Watermark RemovalWatermarked Text 1500 tokens
EWD73
30
Watermark Removal AttackKGW_self 500 token (test)
ESR89.8
6
LLM Watermark EvasionUnigram (1500 tokens)
ESR98.5
4
Watermark RemovalSemStamp 500-token texts
ESR63.3
3
Watermark Removalk-SemStamp 500-token texts
ESR75.5
3
Watermark RemovalKGW gamma=0.5 delta=8.0 500-token texts
ESR78.5
3
Watermark RemovalKGW gamma=0.75, delta=2.0 500-token texts
ESR90.5
3
Watermark RemovalRLWatermark GAUSSMark 500-token texts
ESR87.5
3
Watermark RemovalUnigram 500 tokens (test)
ESR78.5
3
Showing 10 of 15 rows

Other info

Follow for update