MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

About

Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARSHAL, an end-to-end RL framework that incentivizes Multi-Agent Reasoning through Self-play witH strAtegic LLMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to 28.7% performance improvements in held-out games. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero-shot performance gains of up to 10.0% on AIME, 7.6% on GPQA-Diamond, and 3.5% on average across all benchmarks. These results establish self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.

Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, Yu Wang• 2025

Related benchmarks

Task	Dataset	Result
Scientific Reasoning	GPQA D	Accuracy (%)37.37	77
General Knowledge Reasoning	MMLU-Pro	Accuracy57.8	64
Reasoning	Downstream Reasoning Benchmarks (MATH, GSM8K, AQUA, AIME, AMC, MMLU, GPQA)	Average Accuracy82.15	18
Multi-Agent Strategic Reasoning	ConnectFour OOD	First-mover Normalized Score70.67	18
Adversarial Game Playing	Two Dollar	GPT-5.1 Score35.16	12
Adversarial Game Playing	Don’t Say It	GPT-5.1 Performance52.47	12
Adversarial Game Playing	Negotiation	GPT-5.1 Score14.06	12
Strategic Reasoning	RandomValue Negotiation OOD (held-out variant)	Win Rate12.55	12
Strategic Reasoning	VariableSum Dollar OOD (held-out variant)	Win Rate25.39	12
Strategic Reasoning	HardCore Don'tSayIt OOD (held-out variant)	Win Rate12.11	12

Showing 10 of 24 rows

Other info

Follow for update

@wizwand_team Discord