Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

About

Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang• 2024

Related benchmarks

TaskDatasetResultRank
JailbreakAdvBench Ensemble configuration GPT-4o
Attack Success Rate (ASR)70.5
25
LLM JailbreakingGPTFuzzer Scenario G1
Hypervolume0.503
21
LLM JailbreakingJBB-Behaviors Scenario J2
Hypervolume0.432
21
LLM JailbreakingGPTFuzzer Scenario G2
Hypervolume52
21
LLM JailbreakingGPTFuzzer Scenario G3
Hypervolume0.494
21
LLM JailbreakingJBB-Behaviors Scenario J1
Hypervolume39.9
21
LLM JailbreakingJBB-Behaviors Scenario J3
Hypervolume0.375
21
Jailbreak AttackClaude 3.5
ASR39.5
19
Jailbreak AttackJBB-Behaviors--
16
Jailbreak AttackLLaMA-7B-G (unseen instances)
Hypervolume34
7
Showing 10 of 20 rows

Other info

Follow for update