Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Harnessing LLM to Attack LLM-Guarded Text-to-Image Models

About

To prevent Text-to-Image (T2I) models from generating unethical images, people deploy safety filters to block inappropriate drawing prompts. Previous works have employed token replacement to search adversarial prompts that attempt to bypass these filters, but they have become ineffective as nonsensical tokens fail semantic logic checks. In this paper, we approach adversarial prompts from a different perspective. We demonstrate that rephrasing a drawing intent into multiple benign descriptions of individual visual components can obtain an effective adversarial prompt. We propose a LLM-piloted multi-agent method named DACA to automatically complete intended rephrasing. Our method successfully bypasses the safety filters of DALL-E 3 and Midjourney to generate the intended images, achieving success rates of up to 76.7% and 64% in the one-time attack, and 98% and 84% in the re-use attack, respectively. We open-source our code and dataset on [this link](https://github.com/researchcode003/DACA).

Yimo Deng, Huangxun Chen• 2023

Related benchmarks

TaskDatasetResultRank
JailbreakingQ16
ASR-441
44
JailbreakingMHSC
ASR-411
44
JailbreakingUnsafe Prompts
Bypass Success Rate (Text)98.5
22
Unsafe image generationBorderline (test)
ASR (Q16)28.95
12
Unsafe image generationI2P (test)
ASR (Q16)23.09
12
Unsafe image generationExplicit (test)
ASR (Q16)28.18
12
Text-to-Image Adversarial AttackI2P matching categories subset
Bypass Rate93.3
11
Jailbreak AttackAdvbench subset
ASR (Google Banana 2)32
8
Jailbreak AttackMaliciousInstruct
ASR (Google Banana 2)25
8
Adversarial AttackDALL·E 3 commercial (test)
BR0.65
7
Showing 10 of 11 rows

Other info

Follow for update