Believe What You See: Implicit Constraint Approach for Offline Multi-Agent Reinforcement Learning
About
Learning from datasets without interaction with environments (Offline Learning) is an essential step to apply Reinforcement Learning (RL) algorithms in real-world scenarios. However, compared with the single-agent counterpart, offline multi-agent RL introduces more agents with the larger state and action space, which is more challenging but attracts little attention. We demonstrate current offline RL algorithms are ineffective in multi-agent systems due to the accumulated extrapolation error. In this paper, we propose a novel offline RL algorithm, named Implicit Constraint Q-learning (ICQ), which effectively alleviates the extrapolation error by only trusting the state-action pairs given in the dataset for value estimation. Moreover, we extend ICQ to multi-agent tasks by decomposing the joint-policy under the implicit constraint. Experimental results demonstrate that the extrapolation error is successfully controlled within a reasonable range and insensitive to the number of agents. We further show that ICQ achieves the state-of-the-art performance in the challenging multi-agent offline tasks (StarCraft II). Our code is public online at https://github.com/YiqinYang/ICQ.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL Adroit (expert, human) | Adroit Door Return (Human)6.4 | 29 | |
| Multi-Agent Reinforcement Learning | SMAC corridor (test) | Average Score16.74 | 12 | |
| Multi-Agent Reinforcement Learning | SMAC 6h_vs_8z (test) | Average Score11.55 | 12 | |
| Offline Reinforcement Learning | D4RL AntMaze fixed, play, diverse | AntMaze UMaze (Fixed) Score85 | 10 | |
| StarCraft II micromanagement | StarCraft II 2s3z mixed | Win Rate85 | 8 | |
| StarCraft II micromanagement | StarCraft II 2s3z medium_replay | Win Rate41 | 8 | |
| StarCraft II micromanagement | StarCraft II 5m_vs_6m medium_replay | Win Rate18 | 8 | |
| Multi-agent Offline Reinforcement Learning | MPE PP (Medium-replay) | Score34.5 | 8 | |
| StarCraft II micromanagement | StarCraft II 2s3z medium | Win Rate18 | 8 | |
| StarCraft II micromanagement | StarCraft II 5m_vs_6m medium | Win Rate26 | 8 |