Efficient Online Reinforcement Learning with Offline Data
About
Sample efficiency and exploration remain major challenges in online reinforcement learning (RL). A powerful approach that can be applied to address these issues is the inclusion of offline data, such as prior trajectories from a human expert or a sub-optimal exploration policy. Previous methods have relied on extensive modifications and additional complexity to ensure the effective use of this data. Instead, we ask: can we simply apply existing off-policy methods to leverage offline data when learning online? In this work, we demonstrate that the answer is yes; however, a set of minimal but important changes to existing off-policy RL algorithms are required to achieve reliable performance. We extensively ablate these design choices, demonstrating the key factors that most affect performance, and arrive at a set of recommendations that practitioners can readily apply, whether their data comprise a small number of expert demonstrations or large volumes of sub-optimal trajectories. We see that correct application of these simple recommendations can provide a $\mathbf{2.5\times}$ improvement over existing approaches across a diverse set of competitive benchmarks, with no additional computational overhead. We have released our code at https://github.com/ikostrikov/rlpd.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Lift | Robomimic Lift-State | Success Rate99 | 30 | |
| Square Nut Assembly | Robomimic Square-State | Success Rate0.00e+0 | 30 | |
| Can Pick & Place | Robomimic Can-State | Success Rate0.00e+0 | 30 | |
| Goal-conditioned manipulation | OGBench puzzle-4x4-play | Score58 | 24 | |
| Robotic Manipulation | Can-Image | Success Rate0.00e+0 | 21 | |
| Locomotion | MuJoCo walker2d medium-replay D4RL | Average Normalized Score119 | 16 | |
| Navigation | OGBench humanoidmaze-medium-navigate | Success Rate (Offline)0.00e+0 | 15 | |
| Quadruped Locomotion | Slippery Slope real-world evaluation | Forward Progression0.35 | 15 | |
| Robotic Manipulation | OGBench puzzle-3x3-sparse online | Success Rate100 | 14 | |
| Locomotion | MuJoCo hopper-random | Normalized Score90.2 | 14 |