Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Guardrail Baselines for Unlearning in LLMs

About

Recent work has demonstrated that finetuning is a promising approach to 'unlearn' concepts from large language models. However, finetuning can be expensive, as it requires both generating a set of examples and running iterations of finetuning to update the model. In this work, we show that simple guardrail-based approaches such as prompting and filtering can achieve unlearning results comparable to finetuning. We recommend that researchers investigate these lightweight baselines when evaluating the performance of more computationally intensive finetuning methods. While we do not claim that methods such as prompting or filtering are universal solutions to the problem of unlearning, our work suggests the need for evaluation metrics that can better separate the power of guardrails vs. finetuning, and highlights scenarios where guardrails expose possible unintended behavior in existing metrics and benchmarks.

Pratiksha Thaker, Yash Maurya, Shengyuan Hu, Zhiwei Steven Wu, Virginia Smith• 2024

Related benchmarks

TaskDatasetResultRank
Knowledge RetentionRWKU Famous People Neighbor Set
FB Score56.6
7
Machine UnlearningRWKU Famous People Forget Set
FB Score50.2
7
Membership Inference AttackRWKU Famous People MIA Set
FM43.59
7
Utility PreservationRWKU Famous People Utility Set
GA67.2
7
Showing 4 of 4 rows

Other info

Follow for update