Detecting Language Model Attacks with Perplexity
About
A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | MT-Bench | -- | 189 | |
| Mathematical Reasoning | GSM8K | EM87.8 | 115 | |
| Jailbreak Defense | DeepInception | Harmful Score1.18 | 58 | |
| Jailbreak Defense | AutoDAN | ASR2 | 51 | |
| Jailbreak Defense | HarmBench and AdvBench (test) | GCG Score19.1 | 44 | |
| Jailbreak Defense | GCG | Harmful Score1.02 | 37 | |
| Jailbreak Defense | PAIR | Harmful Score1.18 | 37 | |
| Prohibited Content Detection | ALERT | ASR14 | 34 | |
| Jailbreak Detection | Base64 | Accuracy95 | 30 | |
| Jailbreak Detection | DrAttack | Accuracy97 | 30 |