Training-Free Safe Denoisers for Safe Use of Diffusion Models

About

There is growing concern over the safety of powerful diffusion models (DMs), as they are often misused to produce inappropriate, not-safe-for-work (NSFW) content or generate copyrighted material or data of individuals who wish to be forgotten. Many existing methods tackle these issues by heavily relying on text-based negative prompts or extensively retraining DMs to eliminate certain features or samples. In this paper, we take a radically different approach, directly modifying the sampling trajectory by leveraging a negation set (e.g., unsafe images, copyrighted data, or datapoints needed to be excluded) to avoid specific regions of data distribution, without needing to retrain or fine-tune DMs. We formally derive the relationship between the expected denoised samples that are safe and those that are not safe, leading to our $\textit{safe}$ denoiser which ensures its final samples are away from the area to be negated. Inspired by the derivation, we develop a practical algorithm that successfully produces high-quality samples while avoiding negation areas of the data distribution in text-conditional, class-conditional, and unconditional image generation scenarios. These results hint at the great potential of our training-free safe denoiser for using DMs more safely.

Mingyu Kim, Dongjun Kim, Amman Yusuf, Stefano Ermon, Mijung Park• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	COCO 30k	FID22.55	63
Safe Text-to-Image Generation	MMA-Diffusion	Automatic Safety Rate48.1	20
Safe generation against nudity prompts	MMA-Diffusion	ASR46.9	19
NSFW suppression	Ring-a-Bell	ASR12.7	18
NSFW suppression	Unlearn DiffAtk	ASR20.7	18
Text-to-Image Generation	UnlearnDiff	ASR52.6	7
Inappropriate Content Evaluation	CoPro	Harassment15.6	6

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord