Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Improved Techniques for Optimization-Based Jailbreaking on Large Language Models

About

Large language models (LLMs) are being rapidly developed, and a key component of their widespread deployment is their safety-related alignment. Many red-teaming efforts aim to jailbreak LLMs, where among these efforts, the Greedy Coordinate Gradient (GCG) attack's success has led to a growing interest in the study of optimization-based jailbreaking techniques. Although GCG is a significant milestone, its attacking efficiency remains unsatisfactory. In this paper, we present several improved (empirical) techniques for optimization-based jailbreaks like GCG. We first observe that the single target template of "Sure" largely limits the attacking performance of GCG; given this, we propose to apply diverse target templates containing harmful self-suggestion and/or guidance to mislead LLMs. Besides, from the optimization aspects, we propose an automatic multi-coordinate updating strategy in GCG (i.e., adaptively deciding how many tokens to replace in each step) to accelerate convergence, as well as tricks like easy-to-hard initialisation. Then, we combine these improved technologies to develop an efficient jailbreak method, dubbed I-GCG. In our experiments, we evaluate on a series of benchmarks (such as NeurIPS 2023 Red Teaming Track). The results demonstrate that our improved techniques can help GCG outperform state-of-the-art jailbreaking attacks and achieve nearly 100% attack success rate. The code is released at https://github.com/jiaxiaojunQAQ/I-GCG.

Xiaojun Jia, Tianyu Pang, Chao Du, Yihao Huang, Jindong Gu, Yang Liu, Xiaochun Cao, Min Lin• 2024

Related benchmarks

TaskDatasetResultRank
Token-forcing loss optimizationRandom targets Held-out (val)
Qwen-2.5-7B Loss5.34
56
Jailbreak AttackAdvBench 150 Harmful Behaviors
ASR100
45
Adversarial Attack against SeeAct agentMind2Web 600 tasks (test)
ASR Finance (pass@10)3.5
24
Adversarial Attack against WebExperT agentMind2Web 600 tasks (test)
ASR (Finance, pass@10)2.9
24
Jailbreak AttackQwen2.5-7B
Normalized Rate (NR)0.02
20
Jailbreak AttackMistral-7B
NR40
20
Jailbreak AttackDeepSeek
NR Score0.00e+0
20
Jailbreaking AttackMM-SafetyBench
Attack Success Rate (ASR)90
20
Jailbreak AttackGemma 4B 3
NR32
20
Jailbreak AttackGLM-4-Air
NR6
20
Showing 10 of 35 rows

Other info

Follow for update