Conservative Bayesian Model-Based Value Expansion for Offline Policy Optimization

About

Offline reinforcement learning (RL) addresses the problem of learning a performant policy from a fixed batch of data collected by following some behavior policy. Model-based approaches are particularly appealing in the offline setting since they can extract more learning signals from the logged dataset by learning a model of the environment. However, the performance of existing model-based approaches falls short of model-free counterparts, due to the compounding of estimation errors in the learned model. Driven by this observation, we argue that it is critical for a model-based method to understand when to trust the model and when to rely on model-free estimates, and how to act conservatively w.r.t. both. To this end, we derive an elegant and simple methodology called conservative Bayesian model-based value expansion for offline policy optimization (CBOP), that trades off model-free and model-based estimates during the policy evaluation step according to their epistemic uncertainties, and facilitates conservatism by taking a lower bound on the Bayesian posterior value estimate. On the standard D4RL continuous control tasks, we find that our method significantly outperforms previous model-based approaches: e.g., MOPO by $116.4$%, MOReL by $23.2$% and COMBO by $23.7$%. Further, CBOP achieves state-of-the-art performance on $11$ out of $18$ benchmark datasets while doing on par on the remaining datasets.

Jihwan Jeong, Xiaoyu Wang, Michael Gimelfarb, Hyunwoo Kim, Baher Abdulhai, Scott Sanner• 2022

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-medium-expert	Normalized Score105.4	169
Offline Reinforcement Learning	D4RL hopper-medium-expert	Normalized Score111.6	161
Offline Reinforcement Learning	D4RL Medium-Replay Hopper	Normalized Score104.3	109
Offline Reinforcement Learning	D4RL Medium HalfCheetah	Normalized Score74.3	105
Offline Reinforcement Learning	D4RL Medium Walker2d	Normalized Score95.5	104
Offline Reinforcement Learning	D4RL walker2d-random	Normalized Score17.8	101
Offline Reinforcement Learning	D4RL Medium-Replay HalfCheetah	Normalized Score66.4	97
Offline Reinforcement Learning	D4RL halfcheetah-random	Normalized Score32.8	94
Offline Reinforcement Learning	D4RL walker2d medium-replay	Normalized Score92.7	62
Offline Reinforcement Learning	D4RL MuJoCo Hopper-mr v2 (medium-replay)	Avg Normalized Score104.3	36

Showing 10 of 19 rows

Other info

Follow for update

@wizwand_team Discord