GuardT2I: Defending Text-to-Image Models from Adversarial Prompts

About

Recent advancements in Text-to-Image (T2I) models have raised significant safety concerns about their potential misuse for generating inappropriate or Not-Safe-For-Work (NSFW) contents, despite existing countermeasures such as NSFW classifiers or model fine-tuning for inappropriate concept removal. Addressing this challenge, our study unveils GuardT2I, a novel moderation framework that adopts a generative approach to enhance T2I models' robustness against adversarial prompts. Instead of making a binary classification, GuardT2I utilizes a Large Language Model (LLM) to conditionally transform text guidance embeddings within the T2I models into natural language for effective adversarial prompt detection, without compromising the models' inherent performance. Our extensive experiments reveal that GuardT2I outperforms leading commercial solutions like OpenAI-Moderation and Microsoft Azure Moderator by a significant margin across diverse adversarial scenarios. Our framework is available at https://github.com/cure-lab/GuardT2I.

Yijun Yang, Ruiyuan Gao, Xiao Yang, Jianyuan Zhong, Qiang Xu• 2024

Related benchmarks

Task	Dataset	Result
Safe Text-to-Image Generation	I2P	Inappropriate Probability6	23
Safe Text-to-Image Generation	Unsafe Diffusion (UD)	IP Score11	23
Safe Text-to-Image Generation	CoPro V2 (test)	IP7	23
Safe Text-to-Image Generation	COCO 3K	FID38.2	23
Harmful prompt detection	ViSU	Precision48	11
Harmful prompt detection	adv-MMA	Precision100	6
Harmful prompt detection	MMA	Precision58	6
Harmful prompt detection	Sneakyprompt	Precision52	6
Harmful prompt detection	COCO	Accuracy77	6
Harmful prompt detection	I2P	Accuracy26	6

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord