X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents
About
Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Jailbreak Attack | HarmBench | Attack Success Rate (ASR)85 | 376 | |
| Jailbreak Attack | JailbreakBench (JBB) | ASR52.73 | 54 | |
| Jailbreaking | AdvBench | -- | 44 | |
| Transferable Adversarial Attack | AdvBench LLM Classifier (test) | TASR@16.75e+3 | 39 | |
| Transferable Adversarial Attack | HarmBench Classifier (test) | TASR@168.6 | 37 | |
| Multi-turn Jailbreaking | StrongReject (test) | ASR0.72 | 30 | |
| Illicit task completion | AgentHarm English prompts | AgentHarm Score (AHS)27 | 20 | |
| Jailbreaking | AdvBench | ASR@1 (No Refusal)45.6 | 11 | |
| Jailbreaking | GPT 5.1 | ASR90.5 | 9 | |
| Jailbreaking | GPT-4o | ASR0.94 | 9 |