Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CodeChameleon: Personalized Encryption Framework for Jailbreaking Large Language Models

About

Adversarial misuse, particularly through `jailbreaking' that circumvents a model's safety and ethical protocols, poses a significant challenge for Large Language Models (LLMs). This paper delves into the mechanisms behind such successful attacks, introducing a hypothesis for the safety mechanism of aligned LLMs: intent security recognition followed by response generation. Grounded in this hypothesis, we propose CodeChameleon, a novel jailbreak framework based on personalized encryption tactics. To elude the intent security recognition phase, we reformulate tasks into a code completion format, enabling users to encrypt queries using personalized encryption functions. To guarantee response generation functionality, we embed a decryption function within the instructions, which allows the LLM to decrypt and execute the encrypted queries successfully. We conduct extensive experiments on 7 LLMs, achieving state-of-the-art average Attack Success Rate (ASR). Remarkably, our method achieves an 86.6\% ASR on GPT-4-1106.

Huijie Lv, Xiao Wang, Yuansen Zhang, Caishuang Huang, Shihan Dou, Junjie Ye, Tao Gui, Qi Zhang, Xuanjing Huang• 2024

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackAdvbench subset
ASR56.2
64
Jailbreak AttackAdvBench 50 harmful behaviors
GPT-3.5 Turbo Jailbreak Rate98
32
JailbreakAdvBench Ensemble configuration GPT-4o
Attack Success Rate (ASR)70.5
25
Jailbreak AttackClaude 3.5
ASR39.5
24
LLM JailbreakingGPTFuzzer Scenario G1
Hypervolume0.503
21
LLM JailbreakingJBB-Behaviors Scenario J2
Hypervolume0.432
21
LLM JailbreakingGPTFuzzer Scenario G2
Hypervolume52
21
LLM JailbreakingGPTFuzzer Scenario G3
Hypervolume0.494
21
LLM JailbreakingJBB-Behaviors Scenario J1
Hypervolume39.9
21
LLM JailbreakingJBB-Behaviors Scenario J3
Hypervolume0.375
21
Showing 10 of 25 rows

Other info

Follow for update