Benchmarking Batch Deep Reinforcement Learning Algorithms
About
Widely-used deep reinforcement learning algorithms have been shown to fail in the batch setting--learning from a fixed data set without interaction with the environment. Following this result, there have been several papers showing reasonable performances under a variety of environments and batch settings. In this paper, we benchmark the performance of recent off-policy and batch reinforcement learning algorithms under unified settings on the Atari domain, with data generated by a single partially-trained behavioral policy. We find that under these conditions, many of these algorithms underperform DQN trained online with the same amount of data, as well as the partially-trained behavioral policy. To introduce a strong baseline, we adapt the Batch-Constrained Q-learning algorithm to a discrete-action setting, and show it outperforms all existing algorithms at this task.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Sudoku Solving | Sudoku 2x2 | Final Reward1.3 | 14 | |
| PointNav | MetaUrban 12K (Unseen) | Success Rate (SR)60 | 9 | |
| PointNav | MetaUrban 12K (test) | Success Rate (SR)60 | 9 | |
| SocialNav | MetaUrban 12K (test) | Success Rate (SR)17 | 9 | |
| SocialNav | MetaUrban 12K (Unseen) | Success Rate (SR)8 | 9 | |
| Constrained Reinforcement Learning | GRID | Episodic Reward276.3 | 8 | |
| human-robot task planning and allocation | HRTPA H1 R2 (test) | Makespan1.45e+3 | 8 | |
| human-robot task planning and allocation | HRTPA H1,R3 (test) | Makespan1.47e+3 | 8 | |
| human-robot task planning and allocation | HRTPA H2,R3 (test) | Makespan1.17e+3 | 8 | |
| human-robot task planning and allocation | HRTPA H3,R2 (test) | Makespan1.12e+3 | 8 |