Intention Analysis Makes LLMs A Good Jailbreak Defender

About

Aligning large language models (LLMs) with human values, particularly when facing complex and stealthy jailbreak attacks, presents a formidable challenge. Unfortunately, existing methods often overlook this intrinsic nature of jailbreaks, which limits their effectiveness in such complex scenarios. In this study, we present a simple yet highly effective defense strategy, i.e., Intention Analysis ($\mathbb{IA}$). $\mathbb{IA}$ works by triggering LLMs' inherent self-correct and improve ability through a two-stage process: 1) analyzing the essential intention of the user input, and 2) providing final policy-aligned responses based on the first round conversation. Notably, $\mathbb{IA}$ is an inference-only method, thus could enhance LLM safety without compromising their helpfulness. Extensive experiments on varying jailbreak benchmarks across a wide range of LLMs show that $\mathbb{IA}$ could consistently and significantly reduce the harmfulness in responses (averagely -48.2% attack success rate). Encouragingly, with our $\mathbb{IA}$, Vicuna-7B even outperforms GPT-3.5 regarding attack success rate. We empirically demonstrate that, to some extent, $\mathbb{IA}$ is robust to errors in generated intentions. Further analyses reveal the underlying principle of $\mathbb{IA}$: suppressing LLM's tendency to follow jailbreak prompts, thereby enhancing safety.

Yuqi Zhang, Liang Ding, Lefei Zhang, Dacheng Tao• 2024

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K (test)	Accuracy98	954
Multitask Language Understanding	MMLU (test)	Accuracy85	312
Jailbreak Defense	JBB-Behaviors	ASR0.00e+0	121
Jailbreak Defense	AdvBench	--	115
Jailbreak Defense	PAIR	ASR30	97
Jailbreak Defense	GCG	ASR0.00e+0	91
Jailbreak Defense	DeepInception	Harmful Score1	58
Jailbreak Defense	AutoDAN	ASR10	55
Jailbreak Defense	ReNeLLM	Harmful Score1	42
Model Helpfulness Evaluation	Just-Eval (test)	Helpfulness Score4.77	42

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord