A Policy-Guided Imitation Approach for Offline Reinforcement Learning
About
Offline reinforcement learning (RL) methods can generally be categorized into two types: RL-based and Imitation-based. RL-based methods could in principle enjoy out-of-distribution generalization but suffer from erroneous off-policy evaluation. Imitation-based methods avoid off-policy evaluation but are too conservative to surpass the dataset. In this study, we propose an alternative approach, inheriting the training stability of imitation-style methods while still allowing logical out-of-distribution generalization. We decompose the conventional reward-maximizing policy in offline RL into a guide-policy and an execute-policy. During training, the guide-poicy and execute-policy are learned using only data from the dataset, in a supervised and decoupled manner. During evaluation, the guide-policy guides the execute-policy by telling where it should go so that the reward can be maximized, serving as the \textit{Prophet}. By doing so, our algorithm allows \textit{state-compositionality} from the dataset, rather than \textit{action-compositionality} conducted in prior imitation-style methods. We dumb this new approach Policy-guided Offline RL (\texttt{POR}). \texttt{POR} demonstrates the state-of-the-art performance on D4RL, a standard benchmark for offline RL. We also highlight the benefits of \texttt{POR} in terms of improving with supplementary suboptimal data and easily adapting to new tasks by only changing the guide-poicy.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| hopper locomotion | D4RL hopper medium-replay | Normalized Score98.9 | 56 | |
| walker2d locomotion | D4RL walker2d medium-replay | Normalized Score76.6 | 53 | |
| Locomotion | D4RL walker2d-medium-expert | Normalized Score109.1 | 47 | |
| Locomotion | D4RL Halfcheetah medium | Normalized Score48.8 | 44 | |
| Locomotion | D4RL Walker2d medium | Normalized Score0.811 | 44 | |
| hopper locomotion | D4RL Hopper medium | Normalized Score78.6 | 38 | |
| hopper locomotion | D4RL hopper-medium-expert | Normalized Score90 | 38 | |
| Locomotion | D4RL halfcheetah-medium-expert | Normalized Score94.7 | 37 | |
| Offline Reinforcement Learning | antmaze medium-play | Score84.6 | 35 | |
| HalfCheetah | D4RL Medium-Replay v0 | Normalized Score43.5 | 28 |