A Semantic Invariant Robust Watermark for Large Language Models

About

Watermark algorithms for large language models (LLMs) have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness. Our code and data are available at \href{https://github.com/THU-BPM/Robust_Watermark}{https://github.com/THU-BPM/Robust\_Watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markllm} \footnote{https://github.com/THU-BPM/MarkLLM}.

Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, Lijie Wen• 2023

Related benchmarks

Task	Dataset	Result
Watermark Detection	BookSum	TP @ FP=1%77.4	154
Watermark Detection	C4	TPR @ FPR=1%0.998	95
Watermarking	Natural Questions (NQ) (test)	AUROC99.5	45
Sentence-Level Watermarking	C4	AUROC99.8	40
Watermark Detection	C4	TPR @ 1% FPR96.4	36
Language Modeling	LLaMA-2 13B	Perplexity (PPL)8.566	32
Watermark Removal	Watermarked Text 500 tokens	EWD71	30
Watermark Removal	Watermarked Text 1500 tokens	EWD10.3	30
Watermark Detection	C4 OPT-6.7B	ROC-AUC99.5	26
Watermarking Detection	BookSum (test)	Detection Rate (No Attack)100	24

Showing 10 of 40 rows

Other info

Follow for update

@wizwand_team Discord