Best-of-N Jailbreaking

About

We introduce Best-of-N (BoN) Jailbreaking, a simple black-box algorithm that jailbreaks frontier AI systems across modalities. BoN Jailbreaking works by repeatedly sampling variations of a prompt with a combination of augmentations - such as random shuffling or capitalization for textual prompts - until a harmful response is elicited. We find that BoN Jailbreaking achieves high attack success rates (ASRs) on closed-source language models, such as 89% on GPT-4o and 78% on Claude 3.5 Sonnet when sampling 10,000 augmented prompts. Further, it is similarly effective at circumventing state-of-the-art open-source defenses like circuit breakers. BoN also seamlessly extends to other modalities: it jailbreaks vision language models (VLMs) such as GPT-4o and audio language models (ALMs) like Gemini 1.5 Pro, using modality-specific augmentations. BoN reliably improves when we sample more augmented prompts. Across all modalities, ASR, as a function of the number of samples (N), empirically follows power-law-like behavior for many orders of magnitude. BoN Jailbreaking can also be composed with other black-box algorithms for even more effective attacks - combining BoN with an optimized prefix attack achieves up to a 35% increase in ASR. Overall, our work indicates that, despite their capability, language models are sensitive to seemingly innocuous changes to inputs, which attackers can exploit across modalities.

John Hughes, Sara Price, Aengus Lynch, Rylan Schaeffer, Fazl Barez, Sanmi Koyejo, Henry Sleight, Erik Jones, Ethan Perez, Mrinank Sharma• 2024

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench	--	624
Jailbreak Attack	STRONGREJECT (held-out behaviors)	ASR (0.5 threshold)100	186
Jailbreak Attack Success	STRONGREJECT (train)	ASR (0.5)100	62
Jailbreak Attack Success	STRONGREJECT 60 behaviors (train)	EVUS79	62
Jailbreak Robustness	STRONGREJECT (train)	EVUS79	62
Jailbreaking Attack Success	STRONGREJECT 40 held-out behaviors	EVUS81	62
Token-forcing loss optimization	Random targets Held-out (val)	Qwen-2.5-7B Loss15.39	56
Audio Jailbreak	Audio LLM Jailbreak Prompts	JSR (%)68.27	40
Jailbreak	AdvBench Ensemble configuration GPT-4o	Attack Success Rate (ASR)88.7	25
Jailbreak Attack	Claude 3.5	ASR1.67	24

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord