Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

LLM Watermark Evasion via Bias Inversion

About

Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates ($>99\%$) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests. Our code is available at \href{https://github.com/ml-postech/LLM-Watermark-Evasion-via-Bias-Inversion}{here}.

Jeongyeon Hwang, Sangdon Park, Jungseul Ok• 2025

Related benchmarks

TaskDatasetResultRank
Watermark EvasionLLM Watermarking Algorithms KGW, Unigram, UPV, EWD, DIP, SIR, EXP
KGW Evasion Score99.8
11
Watermark EvasionKGW
Attack Success Rate99.8
6
Watermark EvasionUnigram
Attack Success Rate99.4
6
Watermark EvasionUPV
Attack Success Rate (ASR)99.8
6
Watermark EvasionEWD
Attack Success Rate100
6
Watermark EvasionSIR
Attack Success Rate99.6
6
Watermark EvasionEXP
Attack Success Rate99.8
6
Watermark EvasionDIP
Attack Success Rate100
6
Watermark EvasionDBpedia
KGW98
5
Watermarking RobustnessDolly CW
KGW100
5
Showing 10 of 12 rows

Other info

Follow for update