FLIRT: Feedback Loop In-context Red Teaming

About

Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. In this work, we propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. In particular, taking text-to-image models as target models, we explore different feedback mechanisms to automatically learn effective and diverse adversarial prompts. Our experiments demonstrate that even with enhanced safety features, Stable Diffusion (SD) models are vulnerable to our adversarial prompts, raising concerns on their robustness in practical uses. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models.

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta• 2023

Related benchmarks

Task	Dataset	Result
Red Teaming	I2P Nudity prompts	Failure Rate (FR)6.22	48
Red Teaming	violence prompts	Failure Rate (FR)6.02	48
Jailbreaking	Q16	ASR-441	44
Jailbreaking	MHSC	ASR-418.5	44
Jailbreaking	Unsafe Prompts	Bypass Success Rate (Text)60	22
Adversarial Attack	GPT-4o	ASR12.1	14
Text-to-Video Red-teaming	390 Meta Harmful Seed Prompts	Violence Count45	12
Text-to-Image Adversarial Attack	I2P matching categories subset	Bypass Rate73.3	11
Red Teaming	Hunyuan-Video Seed-free generation	Violence Rate46	3
Red Teaming	Wan Seed-free generation 2.2	Violence Rate45	3

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord