Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

The Surprising Effectiveness of PPO in Cooperative, Multi-Agent Games

About

Proximal Policy Optimization (PPO) is a ubiquitous on-policy reinforcement learning algorithm but is significantly less utilized than off-policy learning algorithms in multi-agent settings. This is often due to the belief that PPO is significantly less sample efficient than off-policy methods in multi-agent systems. In this work, we carefully study the performance of PPO in cooperative multi-agent settings. We show that PPO-based multi-agent algorithms achieve surprisingly strong performance in four popular multi-agent testbeds: the particle-world environments, the StarCraft multi-agent challenge, Google Research Football, and the Hanabi challenge, with minimal hyperparameter tuning and without any domain-specific algorithmic modifications or architectures. Importantly, compared to competitive off-policy methods, PPO often achieves competitive or superior results in both final returns and sample efficiency. Finally, through ablation studies, we analyze implementation and hyperparameter factors that are critical to PPO's empirical performance, and give concrete practical suggestions regarding these factors. Our results show that when using these practices, simple PPO-based methods can be a strong baseline in cooperative multi-agent reinforcement learning. Source code is released at \url{https://github.com/marlbenchmark/on-policy}.

Chao Yu, Akash Velu, Eugene Vinitsky, Jiaxuan Gao, Yu Wang, Alexandre Bayen, Yi Wu• 2021

Related benchmarks

TaskDatasetResultRank
Mean Field Team Games competitionBattlefield 4x4 grid
Avg Reward77.21
25
Multi-Agent Reinforcement LearningSMAC v2 (test)
Win Rate (Protoss 5 Units)38
20
Multi-Agent Reinforcement LearningSMAC maps
5m_vs_6m Score21.9
18
Robot LocomotionHumanoid
Cumulative Reward5.30e+3
16
Multi-Agent Reinforcement LearningStarCraft Multi-Agent Challenge (SMAC)
1c3s5z Win Rate100
13
Multi-Agent Cooperative ControlSMAC 3m v1 (train)
Win Rate100
12
Multi-Agent Reinforcement LearningMulti-Agent MuJoCo HalfCheetah back foot
Average Score3.30e+3
12
Multi-Agent Reinforcement LearningMulti-Agent MuJoCo HalfCheetah fore shin
Average Evaluation Score3.31e+3
12
Inventory ManagementSupply Chain Demand Scenarios
Const-Uni30
12
Mathematical ReasoningGSM8K
Accuracy0.627
12
Showing 10 of 106 rows
...

Other info

Code

Follow for update