Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

An Adversarial Perspective on Machine Unlearning for AI Safety

About

Large language models are finetuned to refuse questions about hazardous knowledge, but these protections can often be bypassed. Unlearning methods aim at completely removing hazardous capabilities from models and make them inaccessible to adversaries. This work challenges the fundamental differences between unlearning and traditional safety post-training from an adversarial perspective. We demonstrate that existing jailbreak methods, previously reported as ineffective against unlearning, can be successful when applied carefully. Furthermore, we develop a variety of adaptive methods that recover most supposedly unlearned capabilities. For instance, we show that finetuning on 10 unrelated examples or removing specific directions in the activation space can recover most hazardous capabilities for models edited with RMU, a state-of-the-art unlearning method. Our findings challenge the robustness of current unlearning approaches and question their advantages over safety training.

Jakub {\L}ucki, Boyi Wei, Yangsibo Huang, Peter Henderson, Florian Tram\`er, Javier Rando• 2024

Related benchmarks

TaskDatasetResultRank
Multi-task Language UnderstandingMMLU
MMLU Accuracy57.45
442
Question AnsweringRetain Set
NC Score1.50e+3
12
Instruction FollowingRetain Set
Instruction Following Accuracy26.6
12
Tamper Resistance EvaluationAdversarial Fine-tuning Bio-risk
Max Unique Examples1.00e+3
11
Machine UnlearningLLaMA-3-8B Unlearning Evaluation Suite (test)
Accuracy70.7
6
UnlearningWMDP
Average VRAM (GB)50.01
4
Showing 6 of 6 rows

Other info

Follow for update