LLM Unlearning via Loss Adjustment with Only Forget Data

About

Unlearning in Large Language Models (LLMs) is essential for ensuring ethical and responsible AI use, especially in addressing privacy leak, bias, safety, and evolving regulations. Existing approaches to LLM unlearning often rely on retain data or a reference LLM, yet they struggle to adequately balance unlearning performance with overall model utility. This challenge arises because leveraging explicit retain data or implicit knowledge of retain data from a reference LLM to fine-tune the model tends to blur the boundaries between the forgotten and retain data, as different queries often elicit similar responses. In this work, we propose eliminating the need to retain data or the reference LLM for response calibration in LLM unlearning. Recognizing that directly applying gradient ascent on the forget data often leads to optimization instability and poor performance, our method guides the LLM on what not to respond to, and importantly, how to respond, based on the forget data. Hence, we introduce Forget data only Loss AjustmenT (FLAT), a "flat" loss adjustment approach which addresses these issues by maximizing f-divergence between the available template answer and the forget answer only w.r.t. the forget data. The variational form of the defined f-divergence theoretically provides a way of loss adjustment by assigning different importance weights for the learning w.r.t. template responses and the forgetting of responses subject to unlearning. Empirical results demonstrate that our approach not only achieves superior unlearning performance compared to existing methods but also minimizes the impact on the model's retained capabilities, ensuring high utility across diverse tasks, including copyrighted content unlearning on Harry Potter dataset and MUSE Benchmark, and entity unlearning on the TOFU dataset.

Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Parag Shah, Yujia Bao, Yang Liu, Wei Wei• 2024

Related benchmarks

Task	Dataset	Result
General Language Understanding	MMLU	MMLU Score58.92	39
General Knowledge Evaluation	MMMLU	MMMLU General Knowledge Accuracy50.12	29
Machine Unlearning	WMDP Cyber (test)	MMLU23.24	29
Machine Unlearning	MUSE-Books Harry Potter v1.0 (Overall)	R-Forget0.47	17
Machine Unlearning	MUSE (forget set (Df) and retain set (Dr))	VerbMem (Df)58.4	15
LLM Unlearning	WMDP	Delta Score62.4	14
Utility Preservation	MUSE-Books Harry Potter (retain set)	R-Retain58.33	13
Knowledge Preservation and Reasoning	MMLU	MMLU Score58.92	13
Unlearning	MUSE-Books Harry Potter 100 samples (forget set)	R-Forget0.47	13
Unlearning	MUSE-Books Harry Potter forget set 500 samples	R-Forget-5000.64	13

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord