Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Red-Teaming Text-to-Image Models via In-Context Experience Replay and Semantic-Preserving Prompt Rewriting

About

Understanding the capabilities of text-to-image (T2I) models in harmful content generation is essential to safety and compliance. However, human red-teaming is costly and inconsistent, driving the need for automatic tools that simulate realistic misuse attempts. Existing methods either require white-box access, fail to generalize across defenses, or produce uninterpretable adversarial tokens, while generating fluent prompts that preserve the original harmful intent remains underexplored despite its practical relevance. We propose ICER, a black-box framework that addresses this gap through two components: an LLM-based rewriter that produces fluent, natural-language adversarial prompts, and in-context experience replay that accumulates successful jailbreaking patterns into a reusable prior. These components are integrated via bandit optimization, enabling ICER to efficiently balance exploiting proven attack strategies with exploring new ones. Experiments across six safety mechanisms show that ICER outperforms seven baselines under both standard and semantics-preserving evaluation, with over 30% of generated prompts transferring to commercial systems like DALL-E 3 and Midjourney.

Zhi-Yi Chin, Pin-Yu Chen, Wei-Chen Chiu, Mario Fritz• 2024

Related benchmarks

TaskDatasetResultRank
Red TeamingI2P Nudity prompts
Failure Rate (FR)17.6
48
Red Teamingviolence prompts
Failure Rate (FR)23.3
48
Nudity Jailbreaking TransferDALL·E Universal nudity jailbreaking prompts 3
Transfer Success Rate35.58
7
Nudity Jailbreaking TransferMidjourney Universal nudity jailbreaking prompts
Transfer Success Rate41.35
7
Nudity Jailbreaking TransferFLUX.1 (Universal nudity jailbreaking prompts)
Transfer Success Rate100
7
Nudity Jailbreaking TransferSD3 Universal nudity jailbreaking prompts
Transfer Success Rate79.69
7
Nudity Jailbreaking TransferSDXL Universal nudity jailbreaking prompts
Transfer Success Rate86.72
7
Showing 7 of 7 rows

Other info

Follow for update