Jailbreaking Leading Safety-Aligned LLMs with Simple Adaptive Attacks

About

We show that even the most recent safety-aligned LLMs are not robust to simple adaptive jailbreaking attacks. First, we demonstrate how to successfully leverage access to logprobs for jailbreaking: we initially design an adversarial prompt template (sometimes adapted to the target LLM), and then we apply random search on a suffix to maximize a target logprob (e.g., of the token "Sure"), potentially with multiple restarts. In this way, we achieve 100% attack success rate -- according to GPT-4 as a judge -- on Vicuna-13B, Mistral-7B, Phi-3-Mini, Nemotron-4-340B, Llama-2-Chat-7B/13B/70B, Llama-3-Instruct-8B, Gemma-7B, GPT-3.5, GPT-4o, and R2D2 from HarmBench that was adversarially trained against the GCG attack. We also show how to jailbreak all Claude models -- that do not expose logprobs -- via either a transfer or prefilling attack with a 100% success rate. In addition, we show how to use random search on a restricted set of tokens for finding trojan strings in poisoned models -- a task that shares many similarities with jailbreaking -- which is the algorithm that brought us the first place in the SaTML'24 Trojan Detection Competition. The common theme behind these attacks is that adaptivity is crucial: different models are vulnerable to different prompting templates (e.g., R2D2 is very sensitive to in-context learning prompts), some models have unique vulnerabilities based on their APIs (e.g., prefilling for Claude), and in some settings, it is crucial to restrict the token search space based on prior knowledge (e.g., for trojan detection). For reproducibility purposes, we provide the code, logs, and jailbreak artifacts in the JailbreakBench format at https://github.com/tml-epfl/llm-adaptive-attacks.

Maksym Andriushchenko, Francesco Croce, Nicolas Flammarion• 2024

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench (test)	ASRHB91.2	276
Jailbreak Attack	JailbreakBench	ASR93	242
Jailbreak Attack	JailbreakBench	ASR@1015	132
Token-forcing loss optimization	Random targets Held-out (val)	Qwen-2.5-7B Loss12.03	56
Biosecurity Misuse Evaluation	BSD Biosecurity	Misuse Rate1.3	49
Refusal Ablation and Jailbreak Attack Success	HarmBench	Attack Success Rate (ASR)94.3	40
Jailbreaking Tool-Using LLM Agents	AgentDojo	ASR47.17	36
Jailbreak Attack	AdvBench and VulMine 520 harmful behaviors and curated prompts	ASR90	36
Adversarial Attack	NQ	ASR100	24
Adversarial Attack	AdvBench	Attack Success Rate (ASR)100	16

Showing 10 of 36 rows

Other info

Follow for update

@wizwand_team Discord