WARP: Weight Teleportation for Attack-Resilient Unlearning Protocols

About

Approximate machine unlearning aims to efficiently remove the influence of specific data points from a trained model, offering a practical alternative to full retraining. However, it introduces privacy risks: an adversary with access to pre- and post-unlearning models can exploit their differences for membership inference or data reconstruction. We show these vulnerabilities arise from two factors: large gradient norms of forget-set samples and the close proximity of unlearned parameters to the original model. To demonstrate their severity, we propose unlearning-specific membership inference and reconstruction attacks, showing that several state-of-the-art methods (e.g., NGP, SCRUB) remain vulnerable. To mitigate this leakage, we introduce WARP, a plug-and-play teleportation defense that leverages neural network symmetries to reduce forget-set gradient energy and increase parameter dispersion while preserving predictions. This reparameterization obfuscates the signal of forgotten data, making it harder for attackers to distinguish forgotten samples from non-members or recover them via reconstruction. Across six unlearning algorithms, our approach achieves consistent privacy gains, reducing adversarial advantage (AUC) by up to 64% in black-box and 92% in white-box settings, while maintaining accuracy on retained data. These results highlight teleportation as a general tool for reducing attack success in approximate unlearning.

Mohammad M Maheri, Xavier Cadet, Peter Chin, Hamed Haddadi• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	CIFAR-10 (test)	Accuracy79.7	3381
Membership Inference Attack	CIFAR-10 (Forget)	AUC66.1	12
Black-box Membership Inference Attack	CIFAR-10 Most-memorized 1% forget samples	AUC0.875	12
Membership Inference Attack	CIFAR-10 (all forget samples)	AUC0.516	5
Membership Inference Attack	CIFAR-10 most-memorized (forget top 5%)	AUC59.8	5
Reconstruction Attack	ImageNet-1K (100 forgotten samples)	PSNR (dB)10.74	2
Data Reconstruction Attack	ImageNet-1K	PSNR (dB)7.38	2
White-box Membership Inference Attack	Tiny-ImageNet (forget-set)	AUC0.755	2

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord