Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PromptCOS: Towards Content-only System Prompt Copyright Auditing for LLMs

About

System prompts are critical for shaping the behavior and output quality of large language model (LLM)-based applications, driving substantial investment in optimizing high-quality prompts beyond traditional handcrafted designs. However, as system prompts become valuable intellectual property, they are increasingly vulnerable to prompt theft and unauthorized use, highlighting the urgent need for effective copyright auditing, especially watermarking. Existing methods rely on verifying subtle logit distribution shifts triggered by a query. We observe that this logit-dependent verification framework is impractical in real-world content-only settings, primarily because (1) random sampling makes content-level generation unstable for verification, and (2) stronger instructions needed for content-level signals compromise prompt fidelity. To overcome these challenges, we propose PromptCOS, the first content-only system prompt copyright auditing method based on content-level output similarity. PromptCOS achieves watermark stability by designing a cyclic output signal as the conditional instruction's target. It preserves prompt fidelity by injecting a small set of auxiliary tokens to encode the watermark, leaving the main prompt untouched. Furthermore, to ensure robustness against malicious removal, we optimize cover tokens, i.e., critical tokens in the original prompt, to ensure that removing auxiliary tokens causes severe performance degradation. Experimental results show that promptCOS achieves high effectiveness (99.3% average watermark similarity), strong distinctiveness (60.8% higher than the best baseline), high fidelity (accuracy degradation no greater than 0.6%), robustness (resilience against four potential attack categories), and high computational efficiency (up to 98.1% cost saving).

Yuchen Yang, Yiming Li, Hongwei Yao, Enhao Huang, Shuo Shao, Yuyi Wang, Zhibo Wang, Dacheng Tao, Zhan Qin• 2025

Related benchmarks

TaskDatasetResultRank
Watermark EmbeddingLLM Prompts
Runtime (min)11.1
25
MathGSM8K
True Workspace Rate100
12
Question AnsweringBIGBENCH II
True WS Score100
12
Natural Language ProcessingBIGBENCH II
Accuracy Degradation (%)-0.37
9
Mathematical ReasoningGSM8K
Accuracy Deg %0.01
9
CodeHumanEval
True WS Score1
8
Code GenerationHumanEval
Accuracy Degradation (%)-0.37
6
Showing 7 of 7 rows

Other info

Follow for update