Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs

About

Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARSHAL, an end-to-end RL framework that incentivizes Multi-Agent Reasoning through Self-play witH strAtegic LLMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to 28.7% performance improvements in held-out games. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero-shot performance gains of up to 10.0% on AIME, 7.6% on GPQA-Diamond, and 3.5% on average across all benchmarks. These results establish self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.

Huining Yuan, Zelai Xu, Zheyue Tan, Xiangmin Yi, Mo Guang, Kaiwen Long, Haojia Hui, Boxun Li, Xinlei Chen, Bo Zhao, Xiao-Ping Zhang, Chao Yu, Yu Wang• 2025

Related benchmarks

TaskDatasetResultRank
ReasoningDownstream Reasoning Benchmarks (MATH, GSM8K, AQUA, AIME, AMC, MMLU, GPQA)
Average Accuracy82.15
18
Multi-Agent ReasoningReasoning Benchmarks Competitive MAD framework (test)
Average Score0.8509
2
Multi-Agent ReasoningReasoning Benchmarks Cooperative AutoGen framework (test)
Overall Accuracy83.58
2
Strategic game playingTic-Tac-Toe (train)
Win Rate54.05
2
Strategic game playingKuhn Poker (train)
Win Rate44.49
2
Strategic game playingMini Hanabi (train)
Win Rate55.28
2
Strategic game playingConnect Four held-out (test)
Win Rate21.55
2
Strategic game playingLeduc Hold'em held-out (test)
Win Rate0.5389
2
Strategic game playingSimple Hanabi held-out (test)
Win Rate37.27
2
Showing 9 of 9 rows

Other info

Follow for update