THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

About

Multi-turn jailbreak attacks pose a growing threat to LLMs by exploiting conversational dynamics such as gradual escalation and cross-turn coordination. Existing defenses either rely on costly retraining -- often degrading model utility -- or apply single-turn analysis independently at each turn, failing to capture how risk accumulates along interaction trajectories. We observe that safety behavior in multi-turn interaction is trajectory-dependent: dialogue history continuously reshapes the model's conditioning context, making it insufficient to evaluate each turn in isolation. Motivated by this insight, we present THRD, the first training-free framework that explicitly models temporal risk accumulation for multi-turn jailbreak defense. THRD integrates four modules: a Turn-level Risk Assessor (TRA) for instantaneous risk estimation, a Historical Context Analyzer (HCA) for cross-turn intent escalation detection, a Response Evaluator (RE) for identifying facilitative outputs, and a Decision Module that combines these signals through a time-evolving scoring mechanism with attenuation-based modulation and trend-aware adjustment. Experiments against state-of-the-art multi-turn attacks -- including tree-search-based and multi-agent collaborative methods -- across two target models show that THRD reduces ASR to 0.2--4.0% while preserving model utility within 1.5% degradation on MMLU and GSM8K. Ablation studies confirm non-redundant module contributions and stable cross-architecture generalization. Analysis of first rejection triggers reveals that over 70% of multi-turn attacks require Turn~2 or later to detect, validating the necessity of explicit temporal aggregation.

Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang, Changliang Li, Pengyuan Liu• 2026

Related benchmarks

Task	Dataset	Result
Language Understanding	MMLU	MMLU Accuracy81.7	307
Mathematical Reasoning	GSM8K	Accuracy92.6	192
Over-refusal evaluation	XSTest	Evaluation Score (avg@4)18.8	70
Jailbreak Robustness	AutoDAN Harm single-turn attack	Attack Success Rate (ASR)0.00e+0	8
Jailbreak Robustness	AutoDAN Adv single-turn attack	ASR0.00e+0	8
Multi-turn Jailbreak	HarmBench	ASR (X-Teaming)4	8
Multi-turn Jailbreak	AdvBench	ASR (X-Teaming)1.3	8
Over-refusal evaluation	XS (test)	OR Score22	7
Harmful Request Defense	JBB	Attack Success Rate (ASR)0.00e+0	7
Harmful Request Defense	authority probes	Block Rate31	7

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord