Provable Robust Watermarking for AI-Generated Text

About

We study the problem of watermarking large language models (LLMs) generated text -- one of the most promising approaches for addressing the safety challenges of LLM usage. In this paper, we propose a rigorous theoretical framework to quantify the effectiveness and robustness of LLM watermarks. We propose a robust and high-quality watermark method, Unigram-Watermark, by extending an existing approach with a simplified fixed grouping strategy. We prove that our watermark method enjoys guaranteed generation quality, correctness in watermark detection, and is robust against text editing and paraphrasing. Experiments on three varying LLMs and two datasets verify that our Unigram-Watermark achieves superior detection accuracy and comparable generation quality in perplexity, thus promoting the responsible use of LLMs. Code is available at https://github.com/XuandongZhao/Unigram-Watermark.

Xuandong Zhao, Prabhanjan Ananth, Lei Li, Yu-Xiang Wang• 2023

Related benchmarks

Task	Dataset	Result
Language Modeling	C4	Perplexity10.21	1565
Mathematical Reasoning	GSM8K (test)	Accuracy12.21	816
Watermark Robustness Analysis	Gemma-2-2B	Post-attack TPR100	49
Watermarking Attack Robustness	Gemma 9B v2 (test)	TPR99	49
Question Answering	TruthfulQA	Truthful*Inf Score71.96	42
Watermark Detection	C4	TPR @ 1% FPR99.8	36
Language Modeling	LLaMA-2 13B	Perplexity (PPL)9.275	32
Watermark Detection	C4 OPT-6.7B	ROC-AUC100	26
NLP Watermarking	WaterBench & RepoBench-P (test)	KoLA Score1.8	24
Watermark Detection	C4 Gemma-7B	ROC-AUC0.998	18

Showing 10 of 39 rows

Other info

Follow for update

@wizwand_team Discord