Defending Large Language Models Against Jailbreak Attacks via Layer-specific Editing

About

Large language models (LLMs) are increasingly being adopted in a wide range of real-world applications. Despite their impressive performance, recent studies have shown that LLMs are vulnerable to deliberately crafted adversarial prompts even when aligned via Reinforcement Learning from Human Feedback or supervised fine-tuning. While existing defense methods focus on either detecting harmful prompts or reducing the likelihood of harmful responses through various means, defending LLMs against jailbreak attacks based on the inner mechanisms of LLMs remains largely unexplored. In this work, we investigate how LLMs response to harmful prompts and propose a novel defense method termed \textbf{L}ayer-specific \textbf{Ed}iting (LED) to enhance the resilience of LLMs against jailbreak attacks. Through LED, we reveal that several critical \textit{safety layers} exist among the early layers of LLMs. We then show that realigning these safety layers (and some selected additional layers) with the decoded safe response from selected target layers can significantly improve the alignment of LLMs against jailbreak attacks. Extensive experiments across various LLMs (e.g., Llama2, Mistral) show the effectiveness of LED, which effectively defends against jailbreak attacks while maintaining performance on benign prompts. Our code is available at \url{https://github.com/ledllm/ledllm}.

Wei Zhao, Zhe Li, Yige Li, Ye Zhang, Jun Sun• 2024

Related benchmarks

Task	Dataset	Result
Natural Language Inference	RTE	Accuracy80.7	590
Named Entity Recognition	CoNLL 03	F1 Score0.499	140
Jailbreak Defense	AdvBench	ASR (PAIR)0.00e+0	115
Multi-turn conversation	MT-Bench	Average Score7.87	107
Jailbreak Defense	HarmBench	PAIR ASR0.00e+0	91
Reasoning	GSM8K	Accuracy (GSM8K)98.7	55
Dialogue Reasoning	MuTual	Accuracy80.8	38
Jailbreak Defense	MaliciousInstruct	ASR (GCG)0.00e+0	30
Summarization	SamSum	ROUGE Score26.7	30
Jailbreak Defense	JailbreakBench	ASR (GCG)0.00e+0	30

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord