Intention Analysis Makes LLMs A Good Jailbreak Defender
About
Aligning large language models (LLMs) with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). $\mathbb{IA}$ works by triggering LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, $\mathbb{IA}$ is robust to errors in generated intentions. Further analyses reveal the underlying principle of $\mathbb{IA}$: suppressing LLM's tendency to follow jailbreak prompts, thereby enhancing safety.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | GSM8K (test) | Accuracy98 | 797 | |
| Multitask Language Understanding | MMLU (test) | Accuracy85 | 303 | |
| Jailbreak Defense | JBB-Behaviors | ASR0.00e+0 | 101 | |
| Jailbreak Defense | DeepInception | Harmful Score1 | 58 | |
| Jailbreak Defense | AutoDAN | ASR10 | 51 | |
| Jailbreak Defense | AdvBench | ASR (Overall)0.00e+0 | 49 | |
| Jailbreak Defense | ReNeLLM | Harmful Score1 | 42 | |
| Model Helpfulness Evaluation | Just-Eval (test) | Helpfulness Score4.77 | 42 | |
| Jailbreak Defense | GCG | Harmful Score1 | 37 | |
| Jailbreak Defense | PAIR | Harmful Score1.8 | 37 |