SAeUron: Interpretable Concept Unlearning in Diffusion Models with Sparse Autoencoders

About

Diffusion models, while powerful, can inadvertently generate harmful or undesirable content, raising significant ethical and safety concerns. Recent machine unlearning approaches offer potential solutions but often lack transparency, making it difficult to understand the changes they introduce to the base model. In this work, we introduce SAeUron, a novel method leveraging features learned by sparse autoencoders (SAEs) to remove unwanted concepts in text-to-image diffusion models. First, we demonstrate that SAEs, trained in an unsupervised manner on activations from multiple denoising timesteps of the diffusion model, capture sparse and interpretable features corresponding to specific concepts. Building on this, we propose a feature selection method that enables precise interventions on model activations to block targeted content while preserving overall performance. Our evaluation shows that SAeUron outperforms existing approaches on the UnlearnCanvas benchmark for concepts and style unlearning, and effectively eliminates nudity when evaluated with I2P. Moreover, we show that with a single SAE, we can remove multiple concepts simultaneously and that in contrast to other methods, SAeUron mitigates the possibility of generating unwanted content under adversarial attack. Code and checkpoints are available at https://github.com/cywinski/SAeUron.

Bartosz Cywi\'nski, Kamil Deja• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	MS-COCO	FID80.42	193
Nudity Erasure	I2P	--	52
Explicit Content Removal	I2P	Buttocks Count3	47
Concept Unlearning	UnlearnDiffAtk	UnlearnDiffAtk0.197	36
Style Unlearning	UnlearnCanvas	UA0.958	36
Nudity Unlearning	I2P	Armpits Count7	33
Object Unlearning	UnlearnCanvas	Unlearning Accuracy (UA)87.16	31
Safety Generalization	I2P (test)	Self-Harm Score85.31	24
Concept Unlearning	UnlearnCanvas	Total Avg. Acc90.1	22
Text-to-Image Generation	VSA	ASR70	21

Showing 10 of 25 rows

Other info

Follow for update

@wizwand_team Discord