Offline Reinforcement Learning via High-Fidelity Generative Behavior Modeling

About

In offline reinforcement learning, weighted regression is a common method to ensure the learned policy stays close to the behavior policy and to prevent selecting out-of-sample actions. In this work, we show that due to the limited distributional expressivity of policy models, previous methods might still select unseen actions during training, which deviates from their initial motivation. To address this problem, we adopt a generative approach by decoupling the learned policy into two parts: an expressive generative behavior model and an action evaluation model. The key insight is that such decoupling avoids learning an explicitly parameterized policy model with a closed-form expression. Directly learning the behavior policy allows us to leverage existing advances in generative modeling, such as diffusion-based methods, to model diverse behaviors. As for action evaluation, we combine our method with an in-sample planning technique to further avoid selecting out-of-sample actions and increase computational efficiency. Experimental results on D4RL datasets show that our proposed method achieves competitive or superior performance compared with state-of-the-art offline RL methods, especially in complex tasks such as AntMaze. We also empirically demonstrate that our method can successfully learn from a heterogeneous dataset containing multiple distinctive but similarly successful strategies, whereas previous unimodal policies fail.

Huayu Chen, Cheng Lu, Chengyang Ying, Hang Su, Jun Zhu• 2022

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL Franka Kitchen	Mixed Success Rate45.4	43
Offline Reinforcement Learning	D4RL Maze2D	Return (UMaze)73.9	31
Offline Reinforcement Learning	D4RL AntMaze	Medium Diverse Success Rate82	27
Multi-Agent Reinforcement Learning	MAMuJoCo HalfCheetah Extreme Env v2 (various)	Average Return2.76e+3	24
Multi-Agent Reinforcement Learning	MAMuJoCo HalfCheetah Random Env v2 (various)	Average Return2.40e+3	24
Multi-Agent Reinforcement Learning	MAMuJoCo HalfCheetah Standard Env v2 (various)	Average Return2.39e+3	24
Multi-Agent Reinforcement Learning	MPE Predator Prey (Expert)	Mean Episode Return256	19
Multi-Agent Reinforcement Learning	MPE Predator Prey Medium	Mean Episode Return127	19
Multi-Agent Reinforcement Learning	MPE Predator Prey (Random)	Mean Episode Return9.3	15
Multi-Agent Reinforcement Learning	MPE Predator Prey (Medium Replay)	Mean Episode Return11.9	15

Showing 10 of 46 rows

Other info

Follow for update

@wizwand_team Discord