Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Proactive defense against LLM Jailbreak

About

The proliferation of powerful large language models (LLMs) has necessitated robust safety alignment, yet these models remain vulnerable to evolving adversarial attacks, including multi-turn jailbreaks that iteratively search for successful queries. Current defenses, which are primarily reactive and static, often fail to handle these iterative attacks. In this paper, we introduce ProAct, a novel proactive defense framework designed to disrupt and mislead these iterative search jailbreak methods. Our core idea is to intentionally mislead these jailbreak methods into thinking that the model has been jailbroken with "spurious responses". These misleading responses provide false signals to the attacker's internal optimization loop, causing the adversarial search to terminate prematurely and effectively jailbreaking the jailbreak. By conducting extensive experiments across state-of-the-art LLMs, jailbreaking frameworks, and safety benchmarks, we demonstrate that our method consistently and significantly reduces attack success rates by up to 94% without affecting utility. When combined with other defense fraeworks, it further reduces the latest attack strategies' success rate to 0%. ProActrepresents an orthogonal defense strategy that serves as an additional guardrail to enhance LLM safety against the most effective jailbreaking attacks.

Weiliang Zhao, Jinjun Peng, Daniel Ben-Levi, Zhou Yu, Junfeng Yang• 2025

Related benchmarks

TaskDatasetResultRank
Language UnderstandingMMLU
MMLU Accuracy78.6
147
Mathematical ReasoningGSM8K
Accuracy82.8
108
Over-refusal evaluationXSTest
Evaluation Score (avg@4)19.7
26
Jailbreak RobustnessAutoDAN Harm single-turn attack
Attack Success Rate (ASR)0.00e+0
8
Jailbreak RobustnessAutoDAN Adv single-turn attack
ASR0.00e+0
8
Multi-turn JailbreakHarmBench
ASR (X-Teaming)63.5
8
Multi-turn JailbreakAdvBench
ASR (X-Teaming)67
8
Showing 7 of 7 rows

Other info

Follow for update