Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Trust Region Policy Optimisation in Multi-Agent Reinforcement Learning

About

Trust region methods rigorously enabled reinforcement learning (RL) agents to learn monotonically improving policies, leading to superior performance on a variety of tasks. Unfortunately, when it comes to multi-agent reinforcement learning (MARL), the property of monotonic improvement may not simply apply; this is because agents, even in cooperative games, could have conflicting directions of policy updates. As a result, achieving a guaranteed improvement on the joint policy where each agent acts individually remains an open challenge. In this paper, we extend the theory of trust region learning to MARL. Central to our findings are the multi-agent advantage decomposition lemma and the sequential policy update scheme. Based on these, we develop Heterogeneous-Agent Trust Region Policy Optimisation (HATPRO) and Heterogeneous-Agent Proximal Policy Optimisation (HAPPO) algorithms. Unlike many existing MARL algorithms, HATRPO/HAPPO do not need agents to share parameters, nor do they need any restrictive assumptions on decomposibility of the joint value function. Most importantly, we justify in theory the monotonic improvement property of HATRPO/HAPPO. We evaluate the proposed methods on a series of Multi-Agent MuJoCo and StarCraftII tasks. Results show that HATRPO and HAPPO significantly outperform strong baselines such as IPPO, MAPPO and MADDPG on all tested tasks, therefore establishing a new state of the art.

Jakub Grudzien Kuba, Ruiqing Chen, Muning Wen, Ying Wen, Fanglei Sun, Jun Wang, Yaodong Yang• 2021

Related benchmarks

TaskDatasetResultRank
Multi-Agent Cooperative ControlSMAC 3m v1 (train)
Win Rate100
12
Multi-Agent Reinforcement LearningSMAC 1c3s5z (test)
Test Win Rate97.5
10
Multi-Agent Reinforcement LearningSMAC-Hard 10m_vs_11m
Win Rate57.6
7
Multi-Agent Reinforcement LearningSMAC Hard 3s5z
Win Rate0.681
7
Multi-Agent Reinforcement LearningSMAC-Hard 2c_vs_64zg
Win Rate0.733
7
Multi-Agent Reinforcement LearningSMAC-Hard (3m)
Win Rate37.3
7
Multi-Agent Reinforcement LearningSMAC-Hard 2s_vs_1sc
Win Rate0.00e+0
7
Multi-Agent Reinforcement LearningSMAC-Hard 3s_vs_4z
Win Rate14.4
7
8mSMAC
Win Rate97.5
6
Multi-Agent Cooperative ControlSMAC 8m v1 (train)
Win Rate97.5
6
Showing 10 of 48 rows

Other info

Follow for update