Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

FLIRT: Feedback Loop In-context Red Teaming

About

Warning: this paper contains content that may be inappropriate or offensive. As generative models become available for public use in various applications, testing and analyzing vulnerabilities of these models has become a priority. In this work, we propose an automatic red teaming framework that evaluates a given black-box model and exposes its vulnerabilities against unsafe and inappropriate content generation. Our framework uses in-context learning in a feedback loop to red team models and trigger them into unsafe content generation. In particular, taking text-to-image models as target models, we explore different feedback mechanisms to automatically learn effective and diverse adversarial prompts. Our experiments demonstrate that even with enhanced safety features, Stable Diffusion (SD) models are vulnerable to our adversarial prompts, raising concerns on their robustness in practical uses. Furthermore, we demonstrate that the proposed framework is effective for red teaming text-to-text models.

Ninareh Mehrabi, Palash Goyal, Christophe Dupuy, Qian Hu, Shalini Ghosh, Richard Zemel, Kai-Wei Chang, Aram Galstyan, Rahul Gupta• 2023

Related benchmarks

TaskDatasetResultRank
JailbreakingQ16
ASR-441
44
JailbreakingMHSC
ASR-418.5
44
JailbreakingUnsafe Prompts
Bypass Success Rate (Text)60
22
Text-to-Video Red-teaming390 Meta Harmful Seed Prompts
Violence Count45
12
Adversarial AttackGPT-4o
ASR12.1
11
Text-to-Image Adversarial AttackI2P matching categories subset
Bypass Rate73.3
11
Red TeamingHunyuan-Video Seed-free generation
Violence Rate46
3
Red TeamingWan Seed-free generation 2.2
Violence Rate45
3
Adversarial AttackLlama Maverick
Attack Success Rate (ASR)12.8
3
Adversarial AttackClaude Haiku
Attack Success Rate9.8
3
Showing 10 of 10 rows

Other info

Follow for update