MUSE: MCTS-Driven Red Teaming Framework for Enhanced Multi-Turn Dialogue Safety in Large Language Models
About
As large language models~(LLMs) become widely adopted, ensuring their alignment with human values is crucial to prevent jailbreaks where adversaries manipulate models to produce harmful content. While most defenses target single-turn attacks, real-world usage often involves multi-turn dialogues, exposing models to attacks that exploit conversational context to bypass safety measures. We introduce MUSE, a comprehensive framework tackling multi-turn jailbreaks from both attack and defense angles. For attacks, we propose MUSE-A, a method that uses frame semantics and heuristic tree search to explore diverse semantic trajectories. For defense, we present MUSE-D, a fine-grained safety alignment approach that intervenes early in dialogues to reduce vulnerabilities. Extensive experiments on various models show that MUSE effectively identifies and mitigates multi-turn vulnerabilities. Code is available at \href{https://github.com/yansiyu02/MUSE}{https://github.com/yansiyu02/MUSE}.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Jailbreak Robustness | AdvBench | -- | 72 | |
| Jailbreak Robustness | HarmBench | -- | 72 | |
| Jailbreaking | HarmBench | -- | 68 | |
| Jailbreak | WildJailBreak (WJB) (test) | ASR@18.5 | 33 | |
| Jailbreak | HarmBench (HB) (standard split) | ASR@151.57 | 33 | |
| Jailbreak | JailbreakBench (original split) | ASR@141.82 | 33 | |
| Multi-turn Jailbreaking | HarmBench (HB), JailbreakBench (JBB), and Wild Jailbreak (WJB) (test) | ASR@1 (Qwen2.5-7B-IT, HB)53.45 | 11 | |
| Jailbreaking | JailbreakBench (JBB) | ASR@1 (Qwen2.5-7B-IT)41.81 | 11 | |
| Jailbreaking | WildJailbreak (WJB) | ASR@1 (Qwen2.5-7B-IT)42.5 | 11 | |
| General Language Capability | General Capability Suite (MMLU, GSM8K, GPQA) | MMLU Accuracy73.6 | 5 |