Improving LLM Unlearning Robustness via Random Perturbations
About
Here, we show that current LLM unlearning methods inherently reduce models' robustness, causing them to misbehave even when a single non-adversarial forget-token is present in the retain-query. Toward understanding underlying causes, we propose a novel theoretical framework that reframes the unlearning process as a backdoor attack and defense problem: we formulate how the forgetting process inadvertently learns to align forget-tokens (backdoor triggers) with the target-representations (target labels). As a result, forget-tokens act as backdoor triggers that, when activated in retain-queries, cause disruptions in unlearned models' behaviors, similar to successful backdoor attacks. The sense that, LLM unlearning methods themselves poison the model, make it more vulnerable to forget-tokens, and hide rather than erase target knowledge, describes their true mechanism. To mitigate the vulnerability caused by the forgetting process, we reinterpret the retaining process as a backdoor defense and propose Random Noise Augmentation (RNA), a lightweight, model and method-agnostic approach with theoretical guarantees for improving the robustness of unlearned models. Extensive experiments demonstrate that RNA significantly improves the robustness of unlearned models while preserving forget and retain performances. This backdoor attack-defense framework offers insights into the mechanism of unlearning that can shed light on future research directions for improving unlearning robustness.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| General Knowledge Evaluation | MMLU | MMLU Accuracy60.3 | 45 | |
| Question Answering | WMDP Cyber QA | Default Accuracy29.1 | 38 | |
| Question Answering | WMDP Biology | Default Score33.3 | 38 | |
| Language Model Robustness under Adversarial Attack | MMLU Adversarially Perturbed | AuA (GCG)43.5 | 19 | |
| General Knowledge Evaluation | MMLU Perturbed | Accuracy53.5 | 8 | |
| General Knowledge Evaluation | MMLU Corporate Biology | Accuracy60.4 | 8 | |
| General Knowledge Evaluation | MMLU Computer Security | Accuracy46 | 8 | |
| Knowledge Unlearning Evaluation | WMDP | Score34.6 | 8 |