Toward Reliable Machine Unlearning: Theory, Algorithms, and Evaluation

About

We propose new methodologies for both unlearning random set of samples and class unlearning and show that they outperform existing methods. The main driver of our unlearning methods is the similarity of predictions to a retrained model on both the forget and remain samples. We introduce Adversarial Machine UNlearning (AMUN), which surpasses prior state-of-the-art methods for image classification based on SOTA MIA scores. AMUN lowers the model's confidence on forget samples by fine-tuning on their corresponding adversarial examples. Through theoretical analysis, we identify factors governing AMUN's performance, including smoothness. To facilitate training of smooth models with a controlled Lipschitz constant, we propose FastClip, a scalable method that performs layer-wise spectral-norm clipping of affine layers. In a separate study, we show that increased smoothness naturally improves adversarial example transfer, thereby supporting the second factor above. Following the same principles for class unlearning, we show that existing methods fail in replicating a retrained model's behavior by introducing a nearest-neighbor membership inference attack (MIA-NN) that uses the probabilities assigned to neighboring classes to detect unlearned samples and demonstrate the vulnerability of such methods. We then propose a fine-tuning objective that mitigates this leakage by approximating, for forget-class inputs, the distribution over remaining classes that a model retrained from scratch would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model's distribution accordingly. The resulting Tilted ReWeighting(TRW) distribution serves as the desired target during fine-tuning. Across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior metrics.

Ali Ebrahimpour-Boroojeny• 2025

Related benchmarks

Task	Dataset	Result
Class Unlearning	CIFAR-10	Retain Accuracy94.39	60
Machine Unlearning	CIFAR-10 Random Forget 10% (train)	Retain Accuracy97.71	37
Single-class Unlearning	MNIST	Accuracy Retention (ACCr)0.9952	36
Single-class Unlearning	CIFAR-100	ACCr78.05	28
Machine Unlearning	CIFAR-10 Random Forget (50%)	Unlearn Acc93.56	27
Class Unlearning	Tiny ImageNet (test)	Df (Degree of Forgetting)0.00e+0	19
Machine Unlearning	CIFAR-10 Random Forget (10%)	Unlearn Accuracy95.45	16
Machine Unlearning	CIFAR-10 Random Forget 50% (train)	Unlearn Acc92.77	15
Class Unlearning	CIFAR-10	U-LiRA Accuracy71.12	12
Unlearning	CIFAR-10 Random Forget (10%)	Unlearn Accuracy95.45	9

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord