Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SafeDecoding: Defending against Jailbreak Attacks via Safety-Aware Decoding

About

As large language models (LLMs) become increasingly integrated into real-world applications such as code generation and chatbot assistance, extensive efforts have been made to align LLM behavior with human values, including safety. Jailbreak attacks, aiming to provoke unintended and unsafe behaviors from LLMs, remain a significant/leading LLM safety threat. In this paper, we aim to defend LLMs against jailbreak attacks by introducing SafeDecoding, a safety-aware decoding strategy for LLMs to generate helpful and harmless responses to user queries. Our insight in developing SafeDecoding is based on the observation that, even though probabilities of tokens representing harmful contents outweigh those representing harmless responses, safety disclaimers still appear among the top tokens after sorting tokens by probability in descending order. This allows us to mitigate jailbreak attacks by identifying safety disclaimers and amplifying their token probabilities, while simultaneously attenuating the probabilities of token sequences that are aligned with the objectives of jailbreak attacks. We perform extensive experiments on five LLMs using six state-of-the-art jailbreak attacks and four benchmark datasets. Our results show that SafeDecoding significantly reduces the attack success rate and harmfulness of jailbreak attacks without compromising the helpfulness of responses to benign user queries. SafeDecoding outperforms six defense methods.

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, Radha Poovendran• 2024

Related benchmarks

TaskDatasetResultRank
Code GenerationHumanEval
Pass@117.62
850
Mathematical ReasoningGSM8K (test)
Accuracy38.67
797
Instruction FollowingMT-Bench
MT-Bench Score6.63
189
Mathematical ReasoningGSM8K
EM89.1
115
Jailbreak DefenseDeepInception
Harmful Score1
58
Jailbreak DefenseAutoDAN
ASR0.00e+0
51
Jailbreak AttackPrefilling Attack 10 tokens
ASR65.76
45
Jailbreak AttackPrefilling Attack 20 tokens
ASR13.33
45
Jailbreak AttackPrefilling Attack 40 tokens
ASR (%)13.64
45
Jailbreak DefenseHarmBench and AdvBench (test)
GCG Score13.4
44
Showing 10 of 39 rows

Other info

Code

Follow for update