Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

A Semantic Invariant Robust Watermark for Large Language Models

About

Watermark algorithms for large language models (LLMs) have achieved extremely high accuracy in detecting text generated by LLMs. Such algorithms typically involve adding extra watermark logits to the LLM's logits at each generation step. However, prior algorithms face a trade-off between attack robustness and security robustness. This is because the watermark logits for a token are determined by a certain number of preceding tokens; a small number leads to low security robustness, while a large number results in insufficient attack robustness. In this work, we propose a semantic invariant watermarking method for LLMs that provides both attack robustness and security robustness. The watermark logits in our work are determined by the semantics of all preceding tokens. Specifically, we utilize another embedding LLM to generate semantic embeddings for all preceding tokens, and then these semantic embeddings are transformed into the watermark logits through our trained watermark model. Subsequent analyses and experiments demonstrated the attack robustness of our method in semantically invariant settings: synonym substitution and text paraphrasing settings. Finally, we also show that our watermark possesses adequate security robustness. Our code and data are available at \href{https://github.com/THU-BPM/Robust_Watermark}{https://github.com/THU-BPM/Robust\_Watermark}. Additionally, our algorithm could also be accessed through MarkLLM \citep{pan2024markllm} \footnote{https://github.com/THU-BPM/MarkLLM}.

Aiwei Liu, Leyi Pan, Xuming Hu, Shiao Meng, Lijie Wen• 2023

Related benchmarks

TaskDatasetResultRank
Watermark DetectionC4
TPR @ 1% FPR96.4
36
Language ModelingLLaMA-2 13B
Perplexity (PPL)8.566
32
Watermark DetectionC4 OPT-6.7B
ROC-AUC99.5
26
Watermarking DetectionBookSum (test)
Detection Rate (No Attack)100
24
Watermark DetectionC4
Detection Accuracy (No Attack)100
24
Spoofing attack traceabilityRealToxicityPrompts (test)
AUC64.45
20
Spoofing attack traceabilityRTP-LX (test)
AUC72.25
20
Paraphrase Attack RobustnessC4 RealNewsLike
AUC0.9274
20
Paraphrase Attack RobustnessBookSum
AUC93.06
20
Spoofing Attack RobustnessC4 RealNewsLike
AUC0.4466
20
Showing 10 of 30 rows

Other info

Follow for update