Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Dr. MAS: Stable Reinforcement Learning for Multi-Agent LLM Systems

About

Multi-agent LLM systems enable advanced reasoning and tool use via role specialization, yet reliable reinforcement learning (RL) post-training for such systems remains difficult. In this work, we theoretically pinpoint a key reason for training instability when extending group-based RL to multi-agent LLM systems. We show that under GRPO-style optimization, a global normalization baseline may deviate from diverse agents' reward distributions, which ultimately leads to gradient-norm instability. Based on this finding, we propose Dr. MAS, a simple and stable RL training recipe for multi-agent LLM systems. Dr. MAS uses an agent-wise remedy: normalizing advantages per agent using each agent's own reward statistics, which calibrates gradient scales and dramatically stabilizes training, both theoretically and empirically. Beyond the algorithm, Dr. MAS provides an end-to-end RL training framework for multi-agent LLM systems, supporting scalable orchestration, flexible per-agent LLM serving and optimization configs, and shared resource scheduling of LLM actor backends. We evaluate Dr. MAS on multi-agent math reasoning and multi-turn search benchmarks using Qwen2.5 and Qwen3 series models. Dr. MAS achieves clear gains over vanilla GRPO (e.g., +5.6\% avg@16 and +4.6\% pass@16 on math, and +15.2\% avg@16 and +13.1\% pass@16 on search) while largely eliminating gradient spikes. Moreover, it remains highly effective under heterogeneous agent-model assignments while improving efficiency.

Lang Feng, Longtao Zheng, Shuo He, Fuxiang Zhang, Bo An• 2026

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMATH500 (test)
Accuracy90.7
895
Mathematical ReasoningMinerva (test)
Acc40.9
46
Mathematical ReasoningMinerva
Avg@1640.9
43
Mathematical ReasoningAIME'24 (test)
Accuracy44.6
39
Mathematical ReasoningAIME25 (test)--
33
Mathematical ReasoningMATH 500
Avg@16 Score92.4
23
Mathematical ReasoningOlympiadBench
Pass@1673.6
16
Mathematical ReasoningAIME 24
Avg@1654.8
10
Mathematical ReasoningAIME 25
Avg@1641.5
10
Mathematical ReasoningAMC 23
avg@1689.5
10
Showing 10 of 16 rows

Other info

GitHub

Follow for update