Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Behavior Proximal Policy Optimization

About

Offline reinforcement learning (RL) is a challenging setting where existing off-policy actor-critic methods perform poorly due to the overestimation of out-of-distribution state-action pairs. Thus, various additional augmentations are proposed to keep the learned policy close to the offline dataset (or the behavior policy). In this work, starting from the analysis of offline monotonic policy improvement, we get a surprising finding that some online on-policy algorithms are naturally able to solve offline RL. Specifically, the inherent conservatism of these on-policy algorithms is exactly what the offline RL method needs to overcome the overestimation. Based on this, we propose Behavior Proximal Policy Optimization (BPPO), which solves offline RL without any extra constraint or regularization introduced compared to PPO. Extensive experiments on the D4RL benchmark indicate this extremely succinct method outperforms state-of-the-art offline RL algorithms. Our implementation is available at https://github.com/Dragon-Zhuang/BPPO.

Zifeng Zhuang, Kun Lei, Jinxin Liu, Donglin Wang, Yilang Guo• 2023

Related benchmarks

TaskDatasetResultRank
Hand ManipulationAdroit door-human
Normalized Avg Score25.9
33
Offline Reinforcement LearningD4RL Maze2D
Return (UMaze)58.9
31
Hand ManipulationAdroit door-cloned
Normalized Score-0.1
23
RelocateAdroit Relocate Cloned v0
Normalized Score0.2
21
Offline Reinforcement LearningD4RL Locomotion Full datasets
Hopper Score (m)93.9
21
Offline Reinforcement LearningD4RL AntMaze v2 (various)
UMaze Success Rate53.3
20
PenAdroit Pen Human v0
Normalized Score117.8
19
PenAdroit Pen v0 (Cloned)
Normalized Score3.04e+3
19
HammerAdroit Hammer Human v0
Normalized Score2.4
19
HammerAdroit Hammer Cloned v0
Normalized Score8.4
19
Showing 10 of 26 rows

Other info

Follow for update