Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Query-Efficient Black-Box Red Teaming via Bayesian Optimization

About

The deployment of large-scale generative models is often restricted by their potential risk of causing harm to users in unpredictable ways. We focus on the problem of black-box red teaming, where a red team generates test cases and interacts with the victim model to discover a diverse set of failures with limited query access. Existing red teaming methods construct test cases based on human supervision or language model (LM) and query all test cases in a brute-force manner without incorporating any information from past evaluations, resulting in a prohibitively large number of queries. To this end, we propose Bayesian red teaming (BRT), novel query-efficient black-box red teaming methods based on Bayesian optimization, which iteratively identify diverse positive test cases leading to model failures by utilizing the pre-defined user input pool and the past evaluations. Experimental results on various user input pools demonstrate that our method consistently finds a significantly larger number of diverse positive test cases under the limited query budget than the baseline methods. The source code is available at https://github.com/snu-mllab/Bayesian-Red-Teaming.

Deokjae Lee, JunYeong Lee, Jung-Woo Ha, Jin-Hwa Kim, Sang-Woo Lee, Hwaran Lee, Hyun Oh Song• 2023

Related benchmarks

TaskDatasetResultRank
Open-domain dialogueBloom ZS
RSR47.8
9
Open-domain dialogueOPT-66B ZS
RSR74.7
9
Open-domain dialogueConvAI2
RSR38.6
9
Open-domain dialogueEmpathetic Dialogues
RSR20.6
9
Open-domain dialogueBAD
RSR53.7
9
Red Teaming against BB-3BBloom ZS
RSR4.12e+3
9
Red Teaming against BB-3BOPT-66B ZS
RSR72.3
9
Red Teaming against BB-3BConvAI2
RSR45
9
Red Teaming against BB-3BEmpathetic Dialogues
RSR41.3
9
Red Teaming against BB-3BBAD
RSR66.4
9
Showing 10 of 22 rows

Other info

Code

Follow for update