DeepInception: Hypnotize Large Language Model to Be Jailbreaker

About

Large language models (LLMs) have succeeded significantly in various applications but remain susceptible to adversarial jailbreaks that void their safety guardrails. Previous attempts to exploit these vulnerabilities often rely on high-cost computational extrapolations, which may not be practical or efficient. In this paper, inspired by the authority influence demonstrated in the Milgram experiment, we present a lightweight method to take advantage of the LLMs' personification capabilities to construct $\textit{a virtual, nested scene}$, allowing it to realize an adaptive way to escape the usage control in a normal scenario. Empirically, the contents induced by our approach can achieve leading harmfulness rates with previous counterparts and realize a continuous jailbreak in subsequent interactions, which reveals the critical weakness of self-losing on both open-source and closed-source LLMs, $\textit{e.g.}$, Llama-2, Llama-3, GPT-3.5, GPT-4, and GPT-4o. The code and data are available at: https://github.com/tmlr-group/DeepInception.

Xuan Li, Zhanke Zhou, Jianing Zhu, Jiangchao Yao, Tongliang Liu, Bo Han• 2023

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench	Attack Success Rate (ASR)70.3	624
Jailbreak Attack	AdvBench	AASR98.85	271
Jailbreak Attack	MaliciousInstruct	ASR93	161
Jailbreaking	AdvBench	ASR1	88
Persona Manipulation	ANTHR (test)	Success Score76.04	72
Persona Manipulation	MPI (test)	Success Score65.42	72
Persona Manipulation	BFI (test)	Success Score70	72
Jailbreak	JBB-Behaviors utilitarian dilemmas (test)	Jailbreak Success Rate29	72
Jailbreak Attack	Advbench subset	ASR61.55	64
Jailbreak Attack	JailbreakBench (JBB)	ASR40	62

Showing 10 of 30 rows

Other info

Follow for update

@wizwand_team Discord