MARSHAL: Incentivizing Multi-Agent Reasoning via Self-Play with Strategic LLMs
About
Developing Large Language Models (LLMs) to cooperate and compete effectively within multi-agent systems (MASs) is a critical step towards more advanced intelligence. While reinforcement learning (RL) has proven effective for enhancing reasoning in single-agent tasks, its extension to multi-turn, multi-agent scenarios remains underexplored due to the challenges of long-horizon credit assignment and agent-specific advantage estimation. To address these challenges, we introduce MARSHAL, an end-to-end RL framework that incentivizes Multi-Agent Reasoning through Self-play witH strAtegic LLMs in both cooperative and competitive games. MARSHAL features a turn-level advantage estimator that aligns learning signals with each interaction for credit assignment, and an agent-specific advantage normalization to stabilize multi-agent training. By learning with self-play across cooperative and competitive games, MARSHAL agents trained from Qwen3-4B develop strong strategic abilities, with up to 28.7% performance improvements in held-out games. More importantly, the capability acquired through self-play generalizes beyond games, yielding consistent performance gains of MASs in reasoning benchmarks. When integrated into leading MASs, our MARSHAL agent achieves significant zero-shot performance gains of up to 10.0% on AIME, 7.6% on GPQA-Diamond, and 3.5% on average across all benchmarks. These results establish self-play in strategic games as a powerful approach for developing generalizable multi-agent reasoning capabilities in LLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Scientific Reasoning | GPQA D | Accuracy (%)37.37 | 77 | |
| General Knowledge Reasoning | MMLU-Pro | Accuracy57.8 | 64 | |
| Reasoning | Downstream Reasoning Benchmarks (MATH, GSM8K, AQUA, AIME, AMC, MMLU, GPQA) | Average Accuracy82.15 | 18 | |
| Multi-Agent Strategic Reasoning | ConnectFour OOD | First-mover Normalized Score70.67 | 18 | |
| Adversarial Game Playing | Two Dollar | GPT-5.1 Score35.16 | 12 | |
| Adversarial Game Playing | Don’t Say It | GPT-5.1 Performance52.47 | 12 | |
| Adversarial Game Playing | Negotiation | GPT-5.1 Score14.06 | 12 | |
| Strategic Reasoning | RandomValue Negotiation OOD (held-out variant) | Win Rate12.55 | 12 | |
| Strategic Reasoning | VariableSum Dollar OOD (held-out variant) | Win Rate25.39 | 12 | |
| Strategic Reasoning | HardCore Don'tSayIt OOD (held-out variant) | Win Rate12.11 | 12 |