Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Intention Analysis Makes LLMs A Good Jailbreak Defender

About

Aligning large language models (LLMs) with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). $\mathbb{IA}$ works by triggering LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, $\mathbb{IA}$ is robust to errors in generated intentions. Further analyses reveal the underlying principle of $\mathbb{IA}$: suppressing LLM's tendency to follow jailbreak prompts, thereby enhancing safety.

Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao• 2024

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningGSM8K (test)
Accuracy98
797
Multitask Language UnderstandingMMLU (test)
Accuracy85
303
Jailbreak DefenseJBB-Behaviors
ASR0.00e+0
101
Jailbreak DefenseDeepInception
Harmful Score1
58
Jailbreak DefenseAutoDAN
ASR10
51
Jailbreak DefenseAdvBench
ASR (Overall)0.00e+0
49
Jailbreak DefenseReNeLLM
Harmful Score1
42
Model Helpfulness EvaluationJust-Eval (test)
Helpfulness Score4.77
42
Jailbreak DefenseGCG
Harmful Score1
37
Jailbreak DefensePAIR
Harmful Score1.8
37
Showing 10 of 22 rows

Other info

Follow for update