How Johnny Can Persuade LLMs to Jailbreak Them: Rethinking Persuasion to Challenge AI Safety by Humanizing LLMs

About

Most traditional AI safety research has approached AI models as machines and centered on algorithm-focused attacks developed by security experts. As large language models (LLMs) become increasingly common and competent, non-expert users can also impose risks during daily interactions. This paper introduces a new perspective to jailbreak LLMs as human-like communicators, to explore this overlooked intersection between everyday language interaction and AI safety. Specifically, we study how to persuade LLMs to jailbreak them. First, we propose a persuasion taxonomy derived from decades of social science research. Then, we apply the taxonomy to automatically generate interpretable persuasive adversarial prompts (PAP) to jailbreak LLMs. Results show that persuasion significantly increases the jailbreak performance across all risk categories: PAP consistently achieves an attack success rate of over $92\%$ on Llama 2-7b Chat, GPT-3.5, and GPT-4 in $10$ trials, surpassing recent algorithm-focused attacks. On the defense side, we explore various mechanisms against PAP and, found a significant gap in existing defenses, and advocate for more fundamental mitigation for highly interactive LLMs

Yi Zeng, Hongpeng Lin, Jingwen Zhang, Diyi Yang, Ruoxi Jia, Weiyan Shi• 2024

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench	Attack Success Rate (ASR)45.3	557
Red Teaming	HarmBench	ASR32.9	244
Jailbreaking	AdvBench	ASR76	132
Over-refusal	XSTest	Overrefusal Rate6	102
Jailbreak	JBB-Behaviors utilitarian dilemmas (test)	Jailbreak Success Rate16	72
Jailbreak Attack	Advbench subset	ASR84	64
Red-teaming Safety Evaluation	StrongREJECT	ASR6	53
Language Understanding	MMLU	Average Accuracy64	50
Jailbreak attack success rate	Harmful prompts dataset	Attack Success Rate88	49
Jailbreak Attack	AdvBench GPT-3.5-turbo 1.0 (test)	Attack Success Rate87.8	38

Showing 10 of 42 rows

Other info

Follow for update

@wizwand_team Discord