CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

About

Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang• 2024

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	Advbench subset	ASR56.2	64
Jailbreak Attack	AdvBench 50 harmful behaviors	GPT-3.5 Turbo Jailbreak Rate98	32
Jailbreak	AdvBench Ensemble configuration GPT-4o	Attack Success Rate (ASR)70.5	25
Jailbreak Attack	Claude 3.5	ASR39.5	24
LLM Jailbreaking	GPTFuzzer Scenario G1	Hypervolume0.503	21
LLM Jailbreaking	JBB-Behaviors Scenario J2	Hypervolume0.432	21
LLM Jailbreaking	GPTFuzzer Scenario G2	Hypervolume52	21
LLM Jailbreaking	GPTFuzzer Scenario G3	Hypervolume0.494	21
LLM Jailbreaking	JBB-Behaviors Scenario J1	Hypervolume39.9	21
LLM Jailbreaking	JBB-Behaviors Scenario J3	Hypervolume0.375	21

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord