Making Them Ask and Answer: Jailbreaking Large Language Models in Few Queries via Disguise and Reconstruction

About

In recent years, large language models (LLMs) have demonstrated notable success across various tasks, but the trustworthiness of LLMs is still an open problem. One specific threat is the potential to generate toxic or harmful responses. Attackers can craft adversarial prompts that induce harmful responses from LLMs. In this work, we pioneer a theoretical foundation in LLMs security by identifying bias vulnerabilities within the safety fine-tuning and design a black-box jailbreak method named DRA (Disguise and Reconstruction Attack), which conceals harmful instructions through disguise and prompts the model to reconstruct the original harmful instruction within its completion. We evaluate DRA across various open-source and closed-source models, showcasing state-of-the-art jailbreak success rates and attack efficiency. Notably, DRA boasts a 91.1% attack success rate on OpenAI GPT-4 chatbot.

Tong Liu, Yingjie Zhang, Zhe Zhao, Yinpeng Dong, Guozhu Meng, Kai Chen• 2024

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	Malicious goals dataset (test)	ASR8.25	99
Jailbreak	JBB-Behaviors utilitarian dilemmas (test)	Jailbreak Success Rate83	72
Jailbreak Attack	Advbench subset	ASR44.26	64
Jailbreak Attack	JailbreakBench (JBB)	--	62
Jailbreak Attack	AdvBench-50 + Malicious Instruct	ASR98	40
Jailbreak Attack	HARMFULQA	JADES31	33
Jailbreak Attack	AdvBench 50 harmful behaviors	GPT-3.5 Turbo Jailbreak Rate4	32
Jailbreak Attack	Chao 2024	ASR (DICT)44.26	16
Jailbreak	AdvBench (test)	ASR (GPT-3.5 Turbo)9.42	16
Jailbreak Attack Stealth Evaluation	AdvBench 50	PPL14.6255	10

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord