MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs
About
Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARSHAL, an end-to-end RL framework that incentivizes Multi-Agent Reasoning through Self-play witH strAtegic LLMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to 28.7% performance improvements in held-out games. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero-shot performance gains of up to 10.0% on AIME, 7.6% on GPQA-Diamond, and 3.5% on average across all benchmarks. These results establish self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Reasoning | Downstream Reasoning Benchmarks (MATH, GSM8K, AQUA, AIME, AMC, MMLU, GPQA) | Average Accuracy82.15 | 18 | |
| Multi-Agent Reasoning | Reasoning Benchmarks Competitive MAD framework (test) | Average Score0.8509 | 2 | |
| Multi-Agent Reasoning | Reasoning Benchmarks Cooperative AutoGen framework (test) | Overall Accuracy83.58 | 2 | |
| Strategic game playing | Tic-Tac-Toe (train) | Win Rate54.05 | 2 | |
| Strategic game playing | Kuhn Poker (train) | Win Rate44.49 | 2 | |
| Strategic game playing | Mini Hanabi (train) | Win Rate55.28 | 2 | |
| Strategic game playing | Connect Four held-out (test) | Win Rate21.55 | 2 | |
| Strategic game playing | Leduc Hold'em held-out (test) | Win Rate0.5389 | 2 | |
| Strategic game playing | Simple Hanabi held-out (test) | Win Rate37.27 | 2 |