A Minimalist Approach to Offline Reinforcement Learning
About
Offline reinforcement learning (RL) defines the task of learning from a fixed batch of data. Due to errors in value estimation from out-of-distribution actions, most offline RL algorithms take the approach of constraining or regularizing the policy with the actions contained in the dataset. Built on pre-existing RL algorithms, modifications to make an RL algorithm work offline comes at the cost of additional complexity. Offline RL algorithms introduce new hyperparameters and often leverage secondary components such as generative models, while adjusting the underlying RL algorithm. In this paper we aim to make a deep RL algorithm work while making minimal changes. We find that we can match the performance of state-of-the-art offline RL algorithms by simply adding a behavior cloning term to the policy update of an online RL algorithm and normalizing the data. The resulting algorithm is a simple to implement and tune baseline, while more than halving the overall run time by removing the additional computational overhead of previous methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL halfcheetah-medium-expert | Normalized Score95.9 | 117 | |
| Offline Reinforcement Learning | D4RL hopper-medium-expert | Normalized Score112.4 | 115 | |
| Offline Reinforcement Learning | D4RL walker2d-medium-expert | Normalized Score110.1 | 86 | |
| Offline Reinforcement Learning | D4RL walker2d-random | Normalized Score140 | 77 | |
| Offline Reinforcement Learning | D4RL Medium-Replay Hopper | Normalized Score60.9 | 72 | |
| Offline Reinforcement Learning | D4RL halfcheetah-random | Normalized Score26.1 | 70 | |
| Offline Reinforcement Learning | D4RL Walker2d Medium v2 | Normalized Return83.7 | 67 | |
| Offline Reinforcement Learning | D4RL hopper-random | Normalized Score11.1 | 62 | |
| Offline Reinforcement Learning | D4RL Medium HalfCheetah | Normalized Score48.3 | 59 | |
| Offline Reinforcement Learning | D4RL Medium-Replay HalfCheetah | Normalized Score44.6 | 59 |