Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems
About
Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Mathematical Reasoning | MATH500 (test) | Accuracy90.7 | 895 | |
| Mathematical Reasoning | Minerva (test) | Acc40.9 | 46 | |
| Mathematical Reasoning | Minerva | Avg@1640.9 | 43 | |
| Mathematical Reasoning | AIME'24 (test) | Accuracy44.6 | 39 | |
| Mathematical Reasoning | AIME25 (test) | -- | 33 | |
| Mathematical Reasoning | MATH 500 | Avg@16 Score92.4 | 23 | |
| Mathematical Reasoning | OlympiadBench | Pass@1673.6 | 16 | |
| Mathematical Reasoning | AIME 24 | Avg@1654.8 | 10 | |
| Mathematical Reasoning | AIME 25 | Avg@1641.5 | 10 | |
| Mathematical Reasoning | AMC 23 | avg@1689.5 | 10 |