On the Reliability of Watermarks for Large Language Models
About
As LLMs become commonplace, machine-generated text has the potential to flood the internet with spam, social media bots, and valueless content. Watermarking is a simple and effective strategy for mitigating such harms by enabling the detection and documentation of LLM-generated text. Yet a crucial question remains: How reliable is watermarking in realistic settings in the wild? There, watermarked text may be modified to suit a user's needs, or entirely rewritten to avoid detection. We study the robustness of watermarked text after it is re-written by humans, paraphrased by a non-watermarked LLM, or mixed into a longer hand-written document. We find that watermarks remain detectable even after human and machine paraphrasing. While these attacks dilute the strength of the watermark, paraphrases are statistically likely to leak n-grams or even longer fragments of the original text, resulting in high-confidence detections when enough tokens are observed. For example, after strong human paraphrasing the watermark is detectable after observing 800 tokens on average, when setting a 1e-5 false positive rate. We also consider a range of new detection schemes that are sensitive to short spans of watermarked text embedded inside a large document, and we compare the robustness of watermarking to other kinds of detectors.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| NLP Watermarking | WaterBench & RepoBench-P (test) | KoLA Score1.1 | 24 | |
| Watermark Detection | C4 subset | -- | 24 | |
| Watermark Detection | LLaMA-2 Token Replacement Attack epsilon=0.05 (1,000 generated sequences) | TPR@FPR=0.1%72.38 | 7 | |
| Watermark Detection | LLaMA-2 Token Replacement Attack epsilon=0.1 (1,000 generated sequences) | TPR@FPR=0.1%59.4 | 7 | |
| Watermark Detection | LLaMA-2 Token Replacement Attack, epsilon=0.2 (1,000 generated sequences) | TPR @ FPR=0.1%31.07 | 7 | |
| Watermark Detection | WikiText, IMDB, AG News, Yelp Polarity mixed human-written corpus (test) | FPR (%)2 | 5 | |
| Watermark Detection | Alpaca instruction-following 52K | TPR16 | 5 | |
| Watermark Localization | C4 and Arxiv | Latency (s)33.9 | 4 |