Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Improving LLM Unlearning Robustness via Random Perturbations

About

Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.

Dang Huu-Tien, Hoang Thanh-Tung, Anh Bui, Minh-Phuong Nguyen, Le-Minh Nguyen, Naoya Inoue• 2025

Related benchmarks

TaskDatasetResultRank
General Knowledge EvaluationMMLU
MMLU Accuracy60.3
45
Question AnsweringWMDP Cyber QA
Default Accuracy29.1
38
Question AnsweringWMDP Biology
Default Score33.3
38
Language Model Robustness under Adversarial AttackMMLU Adversarially Perturbed
AuA (GCG)43.5
19
General Knowledge EvaluationMMLU Perturbed
Accuracy53.5
8
General Knowledge EvaluationMMLU Corporate Biology
Accuracy60.4
8
General Knowledge EvaluationMMLU Computer Security
Accuracy46
8
Knowledge Unlearning EvaluationWMDP
Score34.6
8
Showing 8 of 8 rows

Other info

Follow for update