Detecting Language Model Attacks with Perplexity
About
A novel hack involving Large Language Models (LLMs) has emerged, exploiting adversarial suffixes to deceive models into generating perilous responses. Such jailbreaks can trick LLMs into providing intricate instructions to a malicious user for creating explosives, orchestrating a bank heist, or facilitating the creation of offensive content. By evaluating the perplexity of queries with adversarial suffixes using an open-source LLM (GPT-2), we found that they have exceedingly high perplexity values. As we explored a broad range of regular (non-adversarial) prompt varieties, we concluded that false positives are a significant challenge for plain perplexity filtering. A Light-GBM trained on perplexity and token length resolved the false positives and correctly detected most adversarial attacks in the test set.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Instruction Following | MT-Bench | -- | 215 | |
| Mathematical Reasoning | GSM8K | EM87.8 | 123 | |
| Jailbreak Defense | JBB-Behaviors | ASR6.67 | 121 | |
| Jailbreak Defense | Wild Jailbreak | ASR3.3 | 114 | |
| Jailbreak Defense | PAIR | ASR0.18 | 97 | |
| Jailbreak Defense | GCG | ASR0.00e+0 | 91 | |
| Jailbreak Defense | DeepInception | Harmful Score1.18 | 58 | |
| Targeted attack detection | Alpaca OnlyTarget Medium | TPR100 | 56 | |
| Targeted attack detection | Alpaca OnlyTarget Short | TPR100 | 56 | |
| Detection Efficiency | Alpaca OnlyTarget Long (benign) | ATGR1.067 | 56 |