MMA-Diffusion: MultiModal Attack on Diffusion Models

About

In recent years, Text-to-Image (T2I) models have seen remarkable advancements, gaining widespread adoption. However, this progress has inadvertently opened avenues for potential misuse, particularly in generating inappropriate or Not-Safe-For-Work (NSFW) content. Our work introduces MMA-Diffusion, a framework that presents a significant and realistic threat to the security of T2I models by effectively circumventing current defensive measures in both open-source models and commercial online services. Unlike previous approaches, MMA-Diffusion leverages both textual and visual modalities to bypass safeguards like prompt filters and post-hoc safety checkers, thus exposing and highlighting the vulnerabilities in existing defense mechanisms.

Yijun Yang, Ruiyuan Gao, Xiaosen Wang, Tsung-Yi Ho, Nan Xu, Qiang Xu• 2023

Related benchmarks

Task	Dataset	Result
Red Teaming	I2P Nudity prompts	Failure Rate (FR)0.86	48
Red Teaming	violence prompts	Failure Rate (FR)6.02	48
Nudity Erasure	Nudity Erasure	ASR67	48
Jailbreaking	MHSC	ASR-426.5	44
Jailbreaking	Q16	ASR-429.5	44
Jailbreaking	Unsafe Prompts	Bypass Success Rate (Text)58	22
Textual Modal Attack	LAION-COCO subset, UnsafeDiff, and I2P NSFW prompts (test)	Q16 ASR (Step 4)84.9	15
Radiology Report Generation	IU-Xray	ROUGE-L Score32.95	12
Jailbreak Attack	VBCDE	ASR8	12
Jailbreak Attack	UnsafeDiff	Attack Success Rate (ASR)7.3	12

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord