Guardrail Baselines for Unlearning in LLMs

About

Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing metrics and benchmarks.

Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, Virginia Smith• 2024

Related benchmarks

Task	Dataset	Result
Machine Unlearning	WMDP	Bio Accuracy43	74
Machine Unlearning	RWKU	ASG55.2	74
Knowledge Retention	MMLU (full)	MMLU Accuracy76.2	60
General Knowledge Retention	General Utility	Utility Score82.8	34
Machine Unlearning	TOFU (forget01)	--	9
Machine Unlearning	TOFU (Forget05)	--	8
Knowledge Retention	RWKU Famous People Neighbor Set	FB Score56.6	7
Machine Unlearning	RWKU Famous People Forget Set	FB Score50.2	7
Membership Inference Attack	RWKU Famous People MIA Set	FM43.59	7
Utility Preservation	RWKU Famous People Utility Set	GA67.2	7

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord