Behavior Proximal Policy Optimization
About
Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to the overestimation of out-of-distribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we get a surprising finding that some online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark indicate this extremely succinct method outperforms state-of-the-art offline RL algorithms. Our implementation is available at https://github.com/Dragon-Zhuang/BPPO.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Hand Manipulation | Adroit door-human | Normalized Avg Score25.9 | 33 | |
| Hand Manipulation | Adroit door-cloned | Normalized Score-0.1 | 23 | |
| Offline Reinforcement Learning | D4RL AntMaze v2 (various) | UMaze Success Rate53.3 | 20 | |
| Pen | Adroit Pen Human v0 | Normalized Score117.8 | 19 | |
| Relocate | Adroit Relocate Cloned v0 | Normalized Score0.2 | 19 | |
| Pen | Adroit Pen v0 (Cloned) | Normalized Score3.04e+3 | 19 | |
| Hammer | Adroit Hammer Human v0 | Normalized Score2.4 | 19 | |
| Hammer | Adroit Hammer Cloned v0 | Normalized Score8.4 | 19 | |
| Offline Reinforcement Learning | D4RL v2 (various) | Average Score49.5 | 17 | |
| Offline Reinforcement Learning | D4RL Maze2D | Return (Dense, UMaze)51.4 | 15 |