A Unified Game-Theoretic Approach to Multiagent Reinforcement Learning
About
To achieve general intelligence, agents must learn how to interact with others in a shared environment: this is the challenge of multiagent reinforcement learning (MARL). The simplest form is independent reinforcement learning (InRL), where each agent treats its experience as part of its (non-stationary) environment. In this paper, we first observe that policies learned using InRL can overfit to the other agents' policies during training, failing to sufficiently generalize during execution. We introduce a new metric, joint-policy correlation, to quantify this effect. We describe an algorithm for general MARL, based on approximate best responses to mixtures of policies generated using deep reinforcement learning, and empirical game-theoretic analysis to compute meta-strategies for policy selection. The algorithm generalizes previous ones such as InRL, iterated best response, double oracle, and fictitious play. Then, we present a scalable implementation which reduces the memory requirement using decoupled meta-solvers. Finally, we demonstrate the generality of the resulting policies in two partially observable settings: gridworld coordination games and poker.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Two-Player Zero-Sum Game Solving | Goofspiel 13 cards | Estimated PE66.7 | 112 | |
| Competitive Game Strategy Optimization | RPS 1000D | Final KL Divergence0.0011 | 7 | |
| Pursuit-Evasion | Grid Map | Success Rate100 | 7 | |
| Pursuit-Evasion | Scotland-Yard Map | Success Rate100 | 7 | |
| Pursuit-Evasion | Hollywood Walk of Fame | Success Rate95 | 7 | |
| Pursuit-Evasion | The Bund | Success Rate95 | 7 | |
| Coordination Game Strategy Optimization | Battle of Sexes | P1 Strategy Profile0.67 | 7 | |
| Pursuit-Evasion | Downtown Map | Success Rate99 | 7 | |
| Pursuit-Evasion | Sagrada Familia | Success Rate93 | 7 | |
| Pursuit-Evasion | Big Ben | Success Rate99 | 7 |