Bypassing LLM Watermarks with Color-Aware Substitutions

About

Watermarking approaches are proposed to identify if text being circulated is human or large language model (LLM) generated. The state-of-the-art watermarking strategy of Kirchenbauer et al. (2023a) biases the LLM to generate specific (``green'') tokens. However, determining the robustness of this watermarking method is an open problem. Existing attack methods fail to evade detection for longer text segments. We overcome this limitation, and propose {\em Self Color Testing-based Substitution (SCTS)}, the first ``color-aware'' attack. SCTS obtains color information by strategically prompting the watermarked LLM and comparing output tokens frequencies. It uses this information to determine token colors, and substitutes green tokens with non-green ones. In our experiments, SCTS successfully evades watermark detection using fewer number of edits than related work. Additionally, we show both theoretically and empirically that SCTS can remove the watermark for arbitrarily long watermarked text.

Qilong Wu, Varun Chandrasekaran• 2024

Related benchmarks

Task	Dataset	Result
Watermark Detection	Vicuna-7b 16k 50 samples v1.5	AUROC (Overall)0.9852	94
Watermark Detection	Llama-2-7b-chat-hf 10 samples UMD watermarking (test)	AUROC (t=0)1	64
Watermark Attack Robustness	Vicuna 7b 16k v1.5 (test)	ASR62	30
Watermark Attack Success Rate	Llama-2-7b-chat-hf UMD watermarking (10 samples)	ASR100	15
Watermark Evasion	vicuna-7b 50 samples, UMD watermarking v1.5-16k (test)	ASR (0 Unattacked)58	15

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord