Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

About

Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.

Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, Bo An• 2026

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MATH500 (test)	Accuracy90.7	922
Mathematical Reasoning	Minerva (test)	Acc40.9	46
Mathematical Reasoning	AIME25 (test)	--	45
Mathematical Reasoning	Minerva	Avg@1640.9	43
Mathematical Reasoning	AIME'24 (test)	Accuracy44.6	43
Mathematical Reasoning	MATH 500	Avg@16 Score92.4	23
Mathematical Reasoning	AIME 24	Avg@1654.8	21
Mathematical Reasoning	OlympiadBench	Pass@1673.6	16
Mathematical Reasoning	AIME 25	Avg@1641.5	10
Mathematical Reasoning	AMC 23	avg@1689.5	10

Showing 10 of 16 rows

Other info

GitHub

Follow for update

@wizwand_team Discord