Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Steering Dialogue Dynamics for Robustness against Multi-turn Jailbreaking Attacks

About

Large language models (LLMs) are shown to be vulnerable to jailbreaking attacks where adversarial prompts are designed to elicit harmful responses. While existing defenses effectively mitigate single-turn attacks by detecting and filtering unsafe inputs, they fail against multi-turn jailbreaks that exploit contextual drift over multiple interactions, gradually leading LLMs away from safe behavior. To address this challenge, we propose a safety steering framework grounded in safe control theory, ensuring invariant safety in multi-turn dialogues. Our approach models the dialogue with LLMs using state-space representations and introduces a novel neural barrier function (NBF) to detect and filter harmful queries emerging from evolving contexts proactively. Our method achieves invariant safety at each turn of dialogue by learning a safety predictor that accounts for adversarial queries, preventing potential context drift toward jailbreaks. Extensive experiments under multiple LLMs show that our NBF-based safety steering outperforms safety alignment, prompt-based steering and lightweight LLM guardrails baselines, offering stronger defenses against multi-turn jailbreaks while maintaining a better trade-off among safety, helpfulness and over-refusal. Check out the website here https://sites.google.com/view/llm-nbf/home.

Hanjiang Hu, Alexander Robey, Changliu Liu• 2025

Related benchmarks

TaskDatasetResultRank
Over-refusalXSTest--
42
Response Harmfulness DetectionHarmBench
F1 Score84.8
23
Prompt Harmfulness ClassificationWildGuard (test)
F1 (Total)61.7
12
Helpfulness evaluationMTBench
Helpfulness9.35
8
Safety EvaluationActorAttack
ASR3.5
8
Safety EvaluationCrescendo
ASR12
8
Safety EvaluationOpposite-day
ASR4.5
8
Prompt Harmfulness DetectionAegisSafety (test)
F1 Score74.8
5
Adversarial AttackSafeMT ATTACK 600
Attack Success Rate71.29
4
Adversarial AttackMHJ
Attack Success Rate76.72
4
Showing 10 of 14 rows

Other info

Follow for update