Model-Based Offline Reinforcement Learning with Pessimism-Modulated Dynamics Belief

About

Model-based offline reinforcement learning (RL) aims to find highly rewarding policy, by leveraging a previously collected static dataset and a dynamics model. While the dynamics model learned through reuse of the static dataset, its generalization ability hopefully promotes policy learning if properly utilized. To that end, several works propose to quantify the uncertainty of predicted dynamics, and explicitly apply it to penalize reward. However, as the dynamics and the reward are intrinsically different factors in context of MDP, characterizing the impact of dynamics uncertainty through reward penalty may incur unexpected tradeoff between model utilization and risk avoidance. In this work, we instead maintain a belief distribution over dynamics, and evaluate/optimize policy through biased sampling from the belief. The sampling procedure, biased towards pessimism, is derived based on an alternating Markov game formulation of offline RL. We formally show that the biased sampling naturally induces an updated dynamics belief with policy-dependent reweighting factor, termed Pessimism-Modulated Dynamics Belief. To improve policy, we devise an iterative regularized policy optimization algorithm for the game, with guarantee of monotonous improvement under certain condition. To make practical, we further devise an offline RL algorithm to approximately find the solution. Empirical results show that the proposed approach achieves state-of-the-art performance on a wide range of benchmark tasks.

Kaiyang Guo, Yunfeng Shao, Yanhui Geng• 2022

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-medium-expert	Normalized Score108.5	169
Offline Reinforcement Learning	D4RL hopper-medium-expert	Normalized Score111.8	161
Offline Reinforcement Learning	D4RL walker2d-medium-expert	Normalized Score111.9	132
Offline Reinforcement Learning	D4RL Medium-Replay Hopper	Normalized Score106.2	109
Offline Reinforcement Learning	D4RL Medium HalfCheetah	Normalized Score75.6	105
Offline Reinforcement Learning	D4RL Medium Walker2d	Normalized Score94.2	104
Offline Reinforcement Learning	D4RL walker2d-random	Normalized Score21.8	101
Offline Reinforcement Learning	D4RL Medium-Replay HalfCheetah	Normalized Score71.7	97
Offline Reinforcement Learning	D4RL halfcheetah-random	Normalized Score37.8	94
Offline Reinforcement Learning	D4RL hopper-random	Normalized Score32.7	86

Showing 10 of 39 rows

Other info

Code

Follow for update

@wizwand_team Discord