Geometric-disentangelment Unlearning

About

Large language models (LLMs) can internalize private or harmful content, motivating unlearning that removes a forget set while preserving retaining knowledge. However, forgetting updates often cause collateral degradation on retaining knowledge, creating a persistent trade-off. Existing LLM unlearning methods are often heuristic, and other theoretical approaches rely on offline feature constructions that do not capture update-time forget-retain interaction in LLMs. To address this limitation, we aim to develop an LLM unlearning method that reduces the forget-retain trade-off with theoretical guarantees. We take a first-principles view by formalizing "no side effects" as local retain invariance under small parameter updates, and prove an equivalence under optimizer-induced geometry: the retain loss is locally invariant if and only if the update direction is orthogonal to the subspace spanned by retain gradients. Based on the insight, we propose Geometric-disentanglement Unlearning (GU), a lightweight and theoretically grounded projection that can be plug-and-play to existing gradient-based unlearning methods to mitigate forget-retain side effects. Experiments on TOFU, MUSE, and WMDP-cyber show that GU strengthens forgetting while reducing retain drift. When added to SimNPO, it achieves up to 62\% improved forgetting Extraction Strength (ES) and 31\% higher retain ES. We open-sourced our code in https://github.com/Lemutisme/Geometric-Unlearning.

Duo Zhou, Yuji Zhang, Tianxin Wei, Ruizhong Qiu, Ke Yang, Xiao Lin, Cheng Qian, Jingrui He, Hanghang Tong, Chengxiang Zhai, Heng Ji, Huan Zhang• 2025

Related benchmarks

Task	Dataset	Result
Machine Unlearning	CIFAR-10 Random forgetting	RA98.6	11
Random Forgetting Unlearning	CIFAR-100	Retention Accuracy96.48	11
Machine Unlearning	CIFAR-10 Class-wise forgetting	Retention Accuracy (RA)94.2	11
Class-wise Forgetting Unlearning	CIFAR-100	Retention Accuracy (RA)90.34	11
Machine Unlearning	Tiny-ImageNet Random forgetting	RA99.98	10
Machine Unlearning	Tiny-ImageNet (Class-wise forgetting)	Retention Accuracy99.54	10

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord