LLM Watermark Evasion via Bias Inversion

About

Watermarking offers a promising solution for detecting LLM-generated content, yet its robustness under realistic query-free (black-box) evasion remains an open challenge. Existing query-free attacks often achieve limited success or severely distort semantic meaning. We bridge this gap by theoretically analyzing rewriting-based evasion, demonstrating that reducing the average conditional probability of sampling green tokens by a small margin causes the detection probability to decay exponentially. Guided by this insight, we propose the \emph{Bias-Inversion Rewriting Attack} (BIRA), a practical query-free method that applies a negative logit bias to a proxy suppression set identified via token surprisal. Empirically, BIRA achieves state-of-the-art evasion rates ($>99\%$) across diverse watermarking schemes while preserving semantic fidelity substantially better than prior baselines. Our findings reveal a fundamental vulnerability in current watermarking methods and highlight the need for rigorous stress tests. Our code is available at \href{https://github.com/ml-postech/LLM-Watermark-Evasion-via-Bias-Inversion}{here}.

Jeongyeon Hwang, Sangdon Park, Jungseul Ok• 2025

Related benchmarks

Task	Dataset	Result
Watermark Evasion	LLM Watermarking Algorithms KGW, Unigram, UPV, EWD, DIP, SIR, EXP	KGW Evasion Score99.8	11
Watermark Evasion	KGW	Attack Success Rate99.8	6
Watermark Evasion	Unigram	Attack Success Rate99.4	6
Watermark Evasion	UPV	Attack Success Rate (ASR)99.8	6
Watermark Evasion	EWD	Attack Success Rate100	6
Watermark Evasion	SIR	Attack Success Rate99.6	6
Watermark Evasion	EXP	Attack Success Rate99.8	6
Watermark Evasion	DIP	Attack Success Rate100	6
Watermark Evasion	DBpedia	KGW98	5
Watermarking Robustness	Dolly CW	KGW100	5

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord