SAEmnesia: Erasing Concepts in Diffusion Models with Supervised Sparse Autoencoders

About

Concept unlearning in diffusion models is hampered by feature splitting, where concepts are distributed across many latent features, making their removal challenging and computationally expensive. We introduce SAEmnesia, a supervised sparse autoencoder framework that overcomes this by enforcing one-to-one concept-neuron mappings. By systematically labeling concepts during training, our method achieves feature centralization, binding each concept to a single, interpretable neuron. This enables highly targeted and efficient concept erasure. Compared to the state-of-the-art sparse autoencoder-based unlearning approach, SAEmnesia reduces hyperparameter search by 96.67% and achieves a 9.22% improvement on the UnlearnCanvas benchmark for objects. Our method also shows superior scalability in sequential unlearning, improving accuracy by 28.4% when removing nine objects, establishing a step forward for precise and controllable concept erasure. Moreover, SAEmnesia effectively suppresses nudity on the I2P benchmark and remains robust to adversarial attacks. Source code available at https://github.com/EIDOSLAB/SAEmnesia.

Enrico Cassano, Riccardo Renzulli, Marco Nurisso, Mirko Zaffaroni, Alan Perotti, Marco Grangetto• 2025

Related benchmarks

Task	Dataset	Result
Style Unlearning	UnlearnCanvas	UA0.966	36
Nudity Unlearning	I2P	Armpits Count7	33
Object Unlearning	UnlearnCanvas	Unlearning Accuracy (UA)97.6	31
Concept Unlearning	UnlearnCanvas	Total Avg. Acc94.85	22
Concept Unlearning	UnlearnCanvas object concept unlearning	Unlearning Accuracy94.65	11
Object concept unlearning	UnlearnCanvas (IRA images)	FID40.82	8
Object Unlearning	UnlearnCanvas objects	Object UA (Before)97.6	4

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord