PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models
About
Recent text-to-image (T2I) models have exhibited remarkable performance in generating high-quality images from text descriptions. However, these models are vulnerable to misuse, particularly generating not-safe-for-work (NSFW) content, such as sexually explicit, violent, political, and disturbing images, raising serious ethical concerns. In this work, we present PromptGuard, a novel content moderation technique that draws inspiration from the system prompt mechanism in large language models (LLMs) for safety alignment. Unlike LLMs, T2I models lack a direct interface for enforcing behavioral guidelines. Our key idea is to optimize a safety soft prompt that functions as an implicit system prompt within the T2I model's textual embedding space. This universal soft prompt (P*) directly moderates NSFW inputs, enabling safe yet realistic image generation without altering the inference efficiency or requiring proxy models. We further enhance its reliability and helpfulness through a divide-and-conquer strategy, which optimizes category-specific soft prompts and combines them into holistic safety guidance. Extensive experiments across five datasets demonstrate that PromptGuard effectively mitigates NSFW content generation while preserving high-quality benign outputs. PromptGuard achieves 3.8 times faster than prior content moderation methods, surpassing eight state-of-the-art defenses with an optimal unsafe ratio down to 5.84%.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Safe Text-to-Image Generation | I2P | Inappropriate Probability12 | 23 | |
| Safe Text-to-Image Generation | Unsafe Diffusion (UD) | IP Score11 | 23 | |
| Safe Text-to-Image Generation | CoPro V2 (test) | IP7 | 23 | |
| Safe Text-to-Image Generation | COCO 3K | FID46.39 | 23 | |
| Safe Text-to-Image Generation | MMA-Diffusion | -- | 20 | |
| NSFW Content Moderation | Malicious NSFW datasets | Unsafe Ratio (Sexually Explicit)1.5 | 9 | |
| Text-to-Image Safety Guarding | SneakyPrompt-N | Unsafe Ratio0.00e+0 | 9 | |
| Text-to-Image Safety Guarding | SneakyPrompt-P | Unsafe Ratio1.51 | 9 | |
| Image Generation | COCO prompts 2017 | Average Latency (s)1.39 | 9 | |
| Benign Image Generation Preservation | COCO prompts 2017 | CLIP Score25.96 | 9 |