Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

About

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.

Puning Yang, Junchi Yu, Qizhou Wang, Philip Torr, Bo Han, Xiuying Chen• 2026

Related benchmarks

Task	Dataset	Result
Machine Unlearning	TOFU Forget 10%	Aggregation Score8.665	81
Model Unlearning	TOFU Forget 5% 1.0	Model Utility8.519	60
Machine Unlearning	MUSE NEWS	VerbMem (Df)1.759	34
Machine Unlearning	WMDP	Accuracy (Bio)25.67	8
Machine Unlearning	MUSE Books	VerbMem0.00e+0	8
LLM Unlearning	TOFU Forget 5%	RQ7.448	5
LLM Unlearning	TOFU Forget 10%	RQ7.433	5

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord