Perception-guided Jailbreak against Text-to-Image Models
About
In recent years, Text-to-Image (T2I) models have garnered significant attention due to their remarkable advancements. However, security concerns have emerged due to their potential to generate inappropriate or Not-Safe-For-Work (NSFW) images. In this paper, inspired by the observation that texts with different semantics can lead to similar human perceptions, we propose an LLM-driven perception-guided jailbreak method, termed PGJ. It is a black-box jailbreak method that requires no specific T2I model (model-free) and generates highly natural attack prompts. Specifically, we propose identifying a safe phrase that is similar in human perception yet inconsistent in text semantics with the target unsafe word and using it as a substitution. The experiments conducted on six open-source models and commercial online services with thousands of prompts have verified the effectiveness of PGJ.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| NSFW Concept Generation | NSFW-200 Violence v2.1 (test) | ASR-136 | 70 | |
| NSFW Concept Generation | NSFW-200 Sex v2.1 (test) | ASR-120 | 70 | |
| Video Jailbreaking | COCO 2017 (test) | Attack Success Rate47 | 48 | |
| Video Jailbreaking | MM-SafetyBench 1.0 (test) | Attack Success Rate87 | 48 | |
| Adversarial Attack | DALL·E 3 commercial (test) | BR0.55 | 7 | |
| Jailbreaking MLLMs | 110 high-severity samples curated dataset | Jailbreak Success Rate (JSR)36.63 | 6 |