Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Toward Reliable Machine Unlearning: Theory, Algorithms, and Evaluation

About

We propose new methodologies for both unlearning random set of samples and class unlearning and show that they outperform existing methods. The main driver of our unlearning methods is the similarity of predictions to a retrained model on both the forget and remain samples. We introduce Adversarial Machine UNlearning (AMUN), which surpasses prior state-of-the-art methods for image classification based on SOTA MIA scores. AMUN lowers the model's confidence on forget samples by fine-tuning on their corresponding adversarial examples. Through theoretical analysis, we identify factors governing AMUN's performance, including smoothness. To facilitate training of smooth models with a controlled Lipschitz constant, we propose FastClip, a scalable method that performs layer-wise spectral-norm clipping of affine layers. In a separate study, we show that increased smoothness naturally improves adversarial example transfer, thereby supporting the second factor above. Following the same principles for class unlearning, we show that existing methods fail in replicating a retrained model's behavior by introducing a nearest-neighbor membership inference attack (MIA-NN) that uses the probabilities assigned to neighboring classes to detect unlearned samples and demonstrate the vulnerability of such methods. We then propose a fine-tuning objective that mitigates this leakage by approximating, for forget-class inputs, the distribution over remaining classes that a model retrained from scratch would produce. To construct this approximation, we estimate inter-class similarity and tilt the target model's distribution accordingly. The resulting Tilted ReWeighting(TRW) distribution serves as the desired target during fine-tuning. Across multiple benchmarks, TRW matches or surpasses existing unlearning methods on prior metrics.

Ali Ebrahimpour-Boroojeny• 2025

Related benchmarks

TaskDatasetResultRank
Class UnlearningCIFAR-10
Retain Accuracy94.39
39
Single-class UnlearningCIFAR-100
ACCr78.05
28
Single-class UnlearningMNIST
Accuracy Retention (ACCr)0.9952
28
Class UnlearningTiny ImageNet (test)
Df (Degree of Forgetting)0.00e+0
19
Machine UnlearningCIFAR-10 Random Forget (10%)
Unlearn Accuracy95.45
16
Machine UnlearningCIFAR-10 Random Forget (50%)
Unlearn Acc93.56
16
Class UnlearningCIFAR-10
U-LiRA Accuracy71.12
12
UnlearningCIFAR-10 Random Forget (10%)
Unlearn Accuracy95.45
9
UnlearningCIFAR-10 Random Forget (50%)
Unlearn Acc93.56
9
Machine UnlearningCIFAR-10 Random Forget 10% (train)
Unlearn Accuracy94.28
7
Showing 10 of 11 rows

Other info

Follow for update