Improving LLM Unlearning Robustness via Random Perturbations

About

Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.

Dang Huu-Tien, Hoang Thanh-Tung, Anh Bui, Minh-Phuong Nguyen, Le-Minh Nguyen, Naoya Inoue• 2025

Related benchmarks

Task	Dataset	Result
General Knowledge Evaluation	MMLU	MMLU Accuracy60.3	127
Question Answering	WMDP Cyber QA	Default Accuracy29.1	38
Question Answering	WMDP Biology	Default Score33.3	38
Language Model Robustness under Adversarial Attack	MMLU Adversarially Perturbed	AuA (GCG)43.5	19
General Knowledge Evaluation	MMLU Perturbed	Accuracy53.5	8
General Knowledge Evaluation	MMLU Corporate Biology	Accuracy60.4	8
General Knowledge Evaluation	MMLU Computer Security	Accuracy46	8
Knowledge Unlearning Evaluation	WMDP	Score34.6	8

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord