Adversarial Reasoning at Jailbreaking Time

About

As large language models (LLMs) are becoming more capable and widespread, the study of their failure cases is becoming increasingly important. Recent advances in standardizing, measuring, and scaling test-time compute suggest new methodologies for optimizing models to achieve high performance on hard tasks. In this paper, we apply these advances to the task of model jailbreaking: eliciting harmful responses from aligned LLMs. We develop an adversarial reasoning approach to automatic jailbreaking that leverages a loss signal to guide the test-time compute, achieving SOTA attack success rates against many aligned LLMs, even those that aim to trade inference-time compute for adversarial robustness. Our approach introduces a new paradigm in understanding LLM vulnerabilities, laying the foundation for the development of more robust and trustworthy AI systems.

Mahdi Sabbaghi, Paul Kassianik, George Pappas, Yaron Singer, Amin Karbasi, Hamed Hassani• 2025

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	AdvBench Claude-3.5-Sonnet	ASR36	7
Jailbreak Attack	AdvBench Llama-3-8B	ASR88	7
Jailbreak Attack	AdvBench GPT-4o	ASR94	7
Jailbreak Attack	AdvBench o1-preview	Attack Success Rate (ASR)0.56	6
Red Teaming	HarmBench Llama-2-7B (test)	ASR60	5
Red Teaming	HarmBench Llama-3-8B (test)	ASR0.88	5
Red Teaming	HarmBench Claude-Sonnet-3.5 (held-out test)	ASR36	5
Red Teaming	HarmBench gpt-4o-2024-08-06 (test)	ASR86	3

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord