Localized Concept Erasure in Text-to-Image Diffusion Models via High-Level Representation Misdirection

About

Recent advances in text-to-image (T2I) diffusion models have seen rapid and widespread adoption. However, their powerful generative capabilities raise concerns about potential misuse for synthesizing harmful, private, or copyrighted content. To mitigate such risks, concept erasure techniques have emerged as a promising solution. Prior works have primarily focused on fine-tuning the denoising component (e.g., the U-Net backbone). However, recent causal tracing studies suggest that visual attribute information is localized in the early self-attention layers of the text encoder, indicating a potential alternative for concept erasing. Building on this insight, we conduct preliminary experiments and find that directly fine-tuning early layers can suppress target concepts but often degrades the generation quality of non-target concepts. To overcome this limitation, we propose High-Level Representation Misdirection (HiRM), which misdirects high-level semantic representations of target concepts in the text encoder toward designated vectors such as random directions or semantically defined directions (e.g., supercategories), while updating only early layers that contain causal states of visual attributes. Our decoupling strategy enables precise concept removal with minimal impact on unrelated concepts, as demonstrated by strong results on UnlearnCanvas and NSFW benchmarks across diverse targets (e.g., objects, styles, nudity). HiRM also preserves generative utility at low training cost, transfers to state-of-the-art architectures such as Flux without additional training, and shows synergistic effects with denoiser-based concept erasing methods.

Uichan Lee, Jeonghyeon Kim, Sangheum Hwang• 2026

Related benchmarks

Task	Dataset	Result
Concept Unlearning	UnlearnDiffAtk	UnlearnDiffAtk0.1901	36
Inappropriate Content Erasing	I2P	I2P (%)0.96	14
Adversarial Robustness in Concept Erasing	MMA-Diffusion	MMA-Diffusion Score8	14
Utility Preservation	COCO	CLIP Score0.306	14
Adversarial Robustness in Concept Erasing	Ring-A-Bell K-16, K-38, K-77	K-16 Score0.0105	14
Object Erasing	UnlearnCanvas	Unlearning Accuracy (UA)96.2	13
Safety Evaluation	Ring-a-Bell	Ring-16 Score4.76	13
Style Erasing	UnlearnCanvas	UA96.2	13
Concept Erasure	Ring-16	Nudity Rate37.89	7
Text-to-Image Generation	COCO 1K	CLIP Score0.308	7

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord