Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Detecting Language Model Attacks with Perplexity

About

A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.

Gabriel Alon, Michael Kamfonas• 2023

Related benchmarks

TaskDatasetResultRank
Instruction FollowingMT-Bench--
215
Mathematical ReasoningGSM8K
EM87.8
123
Jailbreak DefenseJBB-Behaviors
ASR6.67
121
Jailbreak DefenseWild Jailbreak
ASR3.3
114
Jailbreak DefensePAIR
ASR0.18
97
Jailbreak DefenseGCG
ASR0.00e+0
91
Jailbreak DefenseDeepInception
Harmful Score1.18
58
Targeted attack detectionAlpaca OnlyTarget Medium
TPR100
56
Targeted attack detectionAlpaca OnlyTarget Short
TPR100
56
Detection EfficiencyAlpaca OnlyTarget Long (benign)
ATGR1.067
56
Showing 10 of 49 rows

Other info

Follow for update