Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

About

Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without affecting inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy that optimizes category-specific soft prompts and combines them into unified safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard is 3.8 times faster than prior content moderation methods while outperforming eight state-of-the-art defenses. Evaluations using both a multi-head safety classifier and a VLM-based guardrail further confirm its robustness, with average unsafe ratios of 5.84% and 6.18%, respectively. Our code and dataset are available at https://t2i-promptguard.github.io/.

Lingzhi Yuan, Xinfeng Li, Chejian Xu, Guanhong Tao, Xiaojun Jia, Yihao Huang, Wei Dong, Yang Liu, Bo Li• 2025

Related benchmarks

TaskDatasetResultRank
Safe Text-to-Image GenerationI2P
Inappropriate Probability12
23
Safe Text-to-Image GenerationUnsafe Diffusion (UD)
IP Score11
23
Safe Text-to-Image GenerationCoPro V2 (test)
IP7
23
Safe Text-to-Image GenerationCOCO 3K
FID46.39
23
Safe Text-to-Image GenerationMMA-Diffusion--
20
Concept ErasureNSFW Concepts
Sexually Concept Accuracy12
14
Concept ErasurePainting Style Concepts
Erasure Success (Van Gogh)28
12
Concept ErasureObject Concepts
Car Accuracy47
12
Generative Quality EvaluationGenerative Quality Evaluation Prompts
CLIP Score26.71
11
Concept ErasureAdversarial Prompts (Ring-A-Bell)
Success Rate (Sexually)17.5
11
Showing 10 of 15 rows

Other info

Follow for update