LLM Watermark Evasion via Bias Inversion
About
Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates ($>99\%$) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests. Our code is available at \href{https://github.com/ml-postech/LLM-Watermark-Evasion-via-Bias-Inversion}{here}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Watermark Evasion | LLM Watermarking Algorithms KGW, Unigram, UPV, EWD, DIP, SIR, EXP | KGW Evasion Score99.8 | 11 | |
| Watermark Evasion | KGW | Attack Success Rate99.8 | 6 | |
| Watermark Evasion | Unigram | Attack Success Rate99.4 | 6 | |
| Watermark Evasion | UPV | Attack Success Rate (ASR)99.8 | 6 | |
| Watermark Evasion | EWD | Attack Success Rate100 | 6 | |
| Watermark Evasion | SIR | Attack Success Rate99.6 | 6 | |
| Watermark Evasion | EXP | Attack Success Rate99.8 | 6 | |
| Watermark Evasion | DIP | Attack Success Rate100 | 6 | |
| Watermark Evasion | DBpedia | KGW98 | 5 | |
| Watermarking Robustness | Dolly CW | KGW100 | 5 |