X-Teaming: Multi-Turn Jailbreaks and Defenses with Adaptive Multi-Agents

About

Multi-turn interactions with language models (LMs) pose critical safety risks, as harmful intent can be strategically spread across exchanges. Yet, the vast majority of prior work has focused on single-turn safety, while adaptability and diversity remain among the key challenges of multi-turn red-teaming. To address these challenges, we present X-Teaming, a scalable framework that systematically explores how seemingly harmless interactions escalate into harmful outcomes and generates corresponding attack scenarios. X-Teaming employs collaborative agents for planning, attack optimization, and verification, achieving state-of-the-art multi-turn jailbreak effectiveness and diversity with success rates up to 98.1% across representative leading open-weight and closed-source models. In particular, X-Teaming achieves a 96.2% attack success rate against the latest Claude 3.7 Sonnet model, which has been considered nearly immune to single-turn attacks. Building on X-Teaming, we introduce XGuard-Train, an open-source multi-turn safety training dataset that is 20x larger than the previous best resource, comprising 30K interactive jailbreaks, designed to enable robust multi-turn safety alignment for LMs. Our work offers essential tools and insights for mitigating sophisticated conversational attacks, advancing the multi-turn safety of LMs.

Salman Rahman, Liwei Jiang, James Shiffer, Genglin Liu, Sheriff Issaka, Md Rizwan Parvez, Hamid Palangi, Kai-Wei Chang, Yejin Choi, Saadia Gabriel• 2025

Related benchmarks

Task	Dataset	Result
Jailbreak Attack	HarmBench	Attack Success Rate (ASR)85	557
Jailbreaking	AdvBench	--	132
Jailbreaking	HarmBench	--	68
Jailbreak Attack	159 harmful behaviors (test)	ASR96.23	63
Jailbreak Attack	JailbreakBench (JBB)	ASR52.73	62
Jailbreak Attack	Jailbreak Bench	Attack Success Rate100	42
Transferable Adversarial Attack	AdvBench LLM Classifier (test)	TASR@16.75e+3	39
Transferable Adversarial Attack	HarmBench Classifier (test)	TASR@168.6	37
Jailbreak	WildJailBreak (WJB) (test)	ASR@132.5	33
Jailbreak	HarmBench (HB) (standard split)	ASR@146.54	33

Showing 10 of 36 rows

Other info

Follow for update

@wizwand_team Discord