Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TRAP: Targeted Random Adversarial Prompt Honeypot for Black-Box Identification

About

Large Language Model (LLM) services and models often come with legal rules on who can use them and how they must use them. Assessing the compliance of the released LLMs is crucial, as these rules protect the interests of the LLM contributor and prevent misuse. In this context, we describe the novel fingerprinting problem of Black-box Identity Verification (BBIV). The goal is to determine whether a third-party application uses a certain LLM through its chat function. We propose a method called Targeted Random Adversarial Prompt (TRAP) that identifies the specific LLM in use. We repurpose adversarial suffixes, originally proposed for jailbreaking, to get a pre-defined answer from the target LLM, while other models give random answers. TRAP detects the target LLMs with over 95% true positive rate at under 0.2% false positive rate even after a single interaction. TRAP remains effective even if the LLM has minor changes that do not significantly alter the original function.

Martin Gubri, Dennis Ulmer, Hwaran Lee, Sangdoo Yun, Seong Joon Oh• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K
Math Score43
171
Mathematical ReasoningMGSM
Accuracy42
114
Safety EvaluationToxigen
Safety53
71
Fingerprint VerificationEmbedded Fingerprints (test)
VSR1
60
Fingerprint VerificationFingerprint Verification
VSR100
60
Mathematical ReasoningWizardMath (test)
Math Score43
60
Safety EvaluationLLaMA-2-7B-CHAT Safety (test)
Safety Score0.55
60
Japanese Language UnderstandingJAQKET
Japanese Score77
60
Fingerprint VerificationShisa-7B and Abel-7B-002 Merged
VSR0.99
60
Mathematical ReasoningMGSM (test)
Accuracy (MGSM)38
29
Showing 10 of 18 rows

Other info

Code

Follow for update