Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Not All Turns Matter: Credit Assignment for Multi-Turn Jailbreaking

About

Deploying LLMs in multi-turn dialogues facilitates jailbreak attacks that distribute harmful intent across seemingly benign turns. Recent training-based multi-turn jailbreak methods learn long-horizon attack strategies from interaction feedback, but often rely on coarse trajectory-level outcome signals that broadcast uniformly to every turn. However, we find that turn-level contributions in multi-turn jailbreaking are non-uniform, phase-dependent, and target-specific. Such coarse outcome supervision induces a credit assignment problem, leading to over-rewarding redundant turns in successful trajectories and under-crediting useful intermediate turns in failed ones. To address this, we propose TRACE, a turn-aware credit assignment framework for reinforcement learning (RL)-based multi-turn jailbreaking. For successful trajectories, TRACE estimates turn-level contributions via leave-one-turn-out semantic masking; for failed ones, TRACE assigns penalties based on prompt harmfulness and semantic relevance, with an additional local refusal-aware penalty. Furthermore, we reuse the attack-side credit signal for multi-turn defense alignment. Extensive experiments on open-source and closed-source targets show that TRACE achieves strong overall performance in effectiveness, transferability, and efficiency, yielding about a 25% relative improvement in attack success rate over the strongest RL baseline while also improving the safety-utility balance when reused for defense alignment.

Zhida He, Xiaoyu Wen, Han Qi, Ziyuan Zhou, Peng Yu, Xingcheng Xu, Dongrui Liu, Xia Hu, Chaochao Lu, Qiaosheng Zhang• 2026

Related benchmarks

TaskDatasetResultRank
JailbreakingHarmBench--
68
JailbreakHarmBench (HB) (standard split)
ASR@190.57
33
JailbreakJailbreakBench (original split)
ASR@195.15
33
JailbreakWildJailBreak (WJB) (test)
ASR@183.17
33
JailbreakingJailbreakBench (JBB)
ASR@1 (Qwen2.5-7B-IT)94.55
11
JailbreakingWildJailbreak (WJB)
ASR@1 (Qwen2.5-7B-IT)89.5
11
Multi-turn JailbreakingHarmBench (HB), JailbreakBench (JBB), and Wild Jailbreak (WJB) (test)
ASR@1 (Qwen2.5-7B-IT, HB)87.42
11
Single-Turn Jailbreak RobustnessSingle-Turn Attack Scenarios DAN, WildGuard
DAN ASR8.33
5
General Language CapabilityGeneral Capability Suite (MMLU, GSM8K, GPQA)
MMLU Accuracy73.5
5
Multi-Turn Jailbreak RobustnessMulti-Turn Attack Scenarios Actor Attack, MUSE-A, TRACE
Actor Attack ASR17.61
5
Showing 10 of 10 rows

Other info

Follow for update