Safe Text-to-Image Generation: Simply Sanitize the Prompt Embedding
About
In recent years, text-to-image (T2I) generation models have made significant progress in generating high-quality images that align with text descriptions. However, these models also face the risk of unsafe generation, potentially producing harmful content that violates usage policies, such as explicit material. Existing safe generation methods typically focus on suppressing inappropriate content by erasing undesired concepts from visual representations, while neglecting to sanitize the textual representation. Although these methods help mitigate the risk of misuse to some extent, their robustness remains insufficient when dealing with adversarial attacks. Given that semantic consistency between input text and output image is a core requirement of T2I models, we identify that textual representations are likely the primary source of unsafe generation. To this end, we propose Embedding Sanitizer (ES), which enhances the safety of T2I models by sanitizing inappropriate concepts in prompt embeddings. To our knowledge, ES is the first interpretable safe generation framework that assigns a score to each token in the prompt to indicate its potential harmfulness. In addition, ES adopts a plug-and-play modular design, offering compatibility for seamless integration with various T2I models and other safeguards. Evaluations on five prompt benchmarks show that ES outperforms eleven existing safeguard baselines, achieving state-of-the-art robustness while maintaining high-quality image generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Concept Unlearning Preservation | NSFW | CSDR11.8 | 12 | |
| Common Robustness | I2P | ASR6.57 | 12 | |
| Common Robustness | MMA | ASR15.83 | 12 | |
| Concept Unlearning (NSFW) | IGMU (standard evaluation) | FSR95.09 | 12 |