Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Eraser: Jailbreaking Defense in Large Language Models via Unlearning Harmful Knowledge

About

Jailbreaking attacks can enable Large Language Models (LLMs) to bypass the safeguard and generate harmful content. Existing jailbreaking defense methods have failed to address the fundamental issue that harmful knowledge resides within the model, leading to potential jailbreak risks for LLMs. In this paper, we propose a novel defense method called Eraser, which mainly includes three goals: unlearning harmful knowledge, retaining general knowledge, and maintaining safety alignment. The intuition is that if an LLM forgets the specific knowledge required to answer a harmful question, it will no longer have the ability to answer harmful questions. The training of Erase does not actually require the model's own harmful knowledge, and it can benefit from unlearning general answers related to harmful queries, which means it does not need assistance from the red team. The experimental results show that Eraser can significantly reduce the jailbreaking success rate for various attacks without compromising the general capabilities of the model. Our codes are available at https://github.com/ZeroNLP/Eraser.

Weikai Lu, Ziqian Zeng, Jianwei Wang, Zhengdong Lu, Zelin Chen, Huiping Zhuang, Cen Chen• 2024

Related benchmarks

TaskDatasetResultRank
Commonsense ReasoningHellaSwag
Accuracy57.22
1891
Natural Language InferenceRTE
Accuracy70.86
448
ChatMT-Bench
MT-Bench Score7.86
58
Jailbreak DefenseAutoDAN
ASR7.88
51
Jailbreak DefenseAdvBench
ASR (Overall)0.38
49
Conversational Question AnsweringCoQA
Accuracy75.48
29
Harmful Question ForgettingHarm-2 GPTFUZZER WildAttack
Attack Success Rate (ASR)0.00e+0
28
Mathematical ReasoningGSM8K
Retention72.18
28
Mathematical ReasoningMathQA
Retention24.89
28
Mathematical ReasoningMATH
Retention22.18
28
Showing 10 of 22 rows

Other info

Follow for update