Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Jailbreaking Black Box Large Language Models in Twenty Queries

About

There is growing interest in ensuring that large language models (LLMs) align with human values. However, the alignment of such models is vulnerable to adversarial jailbreaks, which coax LLMs into overriding their safety guardrails. The identification of these vulnerabilities is therefore instrumental in understanding inherent weaknesses and preventing future misuse. To this end, we propose Prompt Automatic Iterative Refinement (PAIR), an algorithm that generates semantic jailbreaks with only black-box access to an LLM. PAIR -- which is inspired by social engineering attacks -- uses an attacker LLM to automatically generate jailbreaks for a separate targeted LLM without human intervention. In this way, the attacker LLM iteratively queries the target LLM to update and refine a candidate jailbreak. Empirically, PAIR often requires fewer than twenty queries to produce a jailbreak, which is orders of magnitude more efficient than existing algorithms. PAIR also achieves competitive jailbreaking success rates and transferability on open and closed-source LLMs, including GPT-3.5/4, Vicuna, and Gemini.

Patrick Chao, Alexander Robey, Edgar Dobriban, Hamed Hassani, George J. Pappas, Eric Wong• 2023

Related benchmarks

TaskDatasetResultRank
Jailbreak AttackHarmBench
Attack Success Rate (ASR)74.5
376
Jailbreak AttackAdvBench
AASR98.3
247
Jailbreak AttackJailbreakBench
ASR@106
132
JailbreakAdvBench
Avg Queries20.3
63
Jailbreak AttackJBB-Behaviors
Rule-Judge Score56
56
Jailbreak AttackJailbreakBench
ASR71
54
Jailbreak AttackJailbreakBench (JBB)--
54
JailbreakingHARMBENCH 159 standard behaviors (test)
ASR7.5
51
Jailbreak AttackShadowRisk
ASR-KW100
48
JailbreakHarmBench Standard Behaviours (200 examples)
ASR7.5
48
Showing 10 of 67 rows

Other info

Follow for update