Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

About

Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy, which optimizes category-specific soft prompts and combines them into holistic safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 3.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Xiaofeng Wang, Bo Li• 2025

Related benchmarks

TaskDatasetResultRank
Safe Text-to-Image GenerationI2P
Inappropriate Probability12
23
Safe Text-to-Image GenerationUnsafe Diffusion (UD)
IP Score11
23
Safe Text-to-Image GenerationCoPro V2 (test)
IP7
23
Safe Text-to-Image GenerationCOCO 3K
FID46.39
23
Safe Text-to-Image GenerationMMA-Diffusion--
20
NSFW Content ModerationMalicious NSFW datasets
Unsafe Ratio (Sexually Explicit)1.5
9
Text-to-Image Safety GuardingSneakyPrompt-N
Unsafe Ratio0.00e+0
9
Text-to-Image Safety GuardingSneakyPrompt-P
Unsafe Ratio1.51
9
Image GenerationCOCO prompts 2017
Average Latency (s)1.39
9
Benign Image Generation PreservationCOCO prompts 2017
CLIP Score25.96
9
Showing 10 of 10 rows

Other info

Follow for update