Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

About

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen• 2026

Related benchmarks

TaskDatasetResultRank
Machine UnlearningTOFU Forget 10%
Aggregation Score8.665
81
Model UnlearningTOFU Forget 5% 1.0
Model Utility8.519
60
Machine UnlearningMUSE NEWS
VerbMem (Df)1.759
34
Machine UnlearningWMDP
Accuracy (Bio)25.67
8
Machine UnlearningMUSE Books
VerbMem0.00e+0
8
LLM UnlearningTOFU Forget 5%
RQ7.448
5
LLM UnlearningTOFU Forget 10%
RQ7.433
5
Showing 7 of 7 rows

Other info

Follow for update