Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models

About

As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at \href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.

Siyu Yan, Long Zeng, Xuecheng Wu, Chengcheng Han, Kongcheng Zhang, Chong Peng, Xuezhi Cao, Xunliang Cai, Chenjuan Guo• 2025

Related benchmarks

TaskDatasetResultRank
Jailbreak RobustnessAdvBench--
72
Jailbreak RobustnessHarmBench--
72
JailbreakingHarmBench--
68
JailbreakWildJailBreak (WJB) (test)
ASR@18.5
33
JailbreakHarmBench (HB) (standard split)
ASR@151.57
33
JailbreakJailbreakBench (original split)
ASR@141.82
33
Multi-turn JailbreakingHarmBench (HB), JailbreakBench (JBB), and Wild Jailbreak (WJB) (test)
ASR@1 (Qwen2.5-7B-IT, HB)53.45
11
JailbreakingJailbreakBench (JBB)
ASR@1 (Qwen2.5-7B-IT)41.81
11
JailbreakingWildJailbreak (WJB)
ASR@1 (Qwen2.5-7B-IT)42.5
11
General Language CapabilityGeneral Capability Suite (MMLU, GSM8K, GPQA)
MMLU Accuracy73.6
5
Showing 10 of 12 rows

Other info

Follow for update