A Closer Look at Machine Unlearning for Large Language Models

About

Large language models (LLMs) may memorize sensitive or copyrighted content, raising privacy and legal concerns. Due to the high cost of retraining from scratch, researchers attempt to employ machine unlearning to remove specific content from LLMs while preserving the overall performance. In this paper, we discuss several issues in machine unlearning for LLMs and provide our insights on possible approaches. To address the issue of inadequate evaluation of model outputs after unlearning, we introduce three additional metrics to evaluate token diversity, sentence semantics, and factual correctness. We then categorize unlearning methods into untargeted and targeted, and discuss their issues respectively. Specifically, the behavior that untargeted unlearning attempts to approximate is unpredictable and may involve hallucinations, and existing regularization is insufficient for targeted unlearning. To alleviate these issues, we propose using the objective of maximizing entropy (ME) for untargeted unlearning and incorporate answer preservation (AP) loss as regularization for targeted unlearning. Experimental results across three scenarios, i.e., fictitious unlearning, continual unlearning, and real-world unlearning, demonstrate the effectiveness of our approaches. The code is available at https://github.com/sail-sg/closer-look-LLM-unlearning.

Xiaojian Yuan, Tianyu Pang, Chao Du, Kejiang Chen, Weiming Zhang, Min Lin• 2024

Related benchmarks

Task	Dataset	Result
Multi-task Language Understanding	MMLU	--	881
Multi-task Language Understanding	MMLU	MMLU Accuracy54.03	442
Knowledge	MMLU	Accuracy47.1	161
Multi-task Language Understanding	MMLU (test)	Normalized Accuracy60.8	87
Language Understanding	MMLU	MMLU Score60.8	70
Language Model Unlearning	TOFU Forget10	Forget Quality (FQ)78.6	54
Machine Unlearning	TOFU forget05 1.0	Model Utility (MU)75.14	53
Machine Unlearning	TOFU 1.0 (forget01)	Average Score75.99	53
Knowledge Evaluation	Natural Questions (NQ) (Evaluation)	Accuracy5.7	45
Machine Unlearning	RWKU Llama 3.1 8B (Forget Set)	FB Score64.4	39

Showing 10 of 42 rows

Other info

Follow for update

@wizwand_team Discord