CROP: Conservative Reward for Model-based Offline Policy Optimization

About

Offline reinforcement learning (RL) aims to optimize a policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges because of their capability to mitigate the limitations of data coverage through data generation using models. Nonetheless, a prevalent issue in offline RL is the overestimation caused by distribution shift. This study proposes a novel model-based offline RL algorithm named Conservative Reward for model-based Offline Policy optimization (CROP). CROP introduces a streamlined objective that concurrently minimizes estimation error and the rewards of random actions, thereby yielding a robustly conservative reward estimator. Theoretical analysis shows that the designed conservative reward mechanism leads to a conservative policy evaluation and mitigates distribution shift. Experiments showcase that with the simple modification to reward estimation, CROP can conservatively estimate the reward and achieve competitive performance with existing methods. The source code will be available after acceptance.

Hao Li, Xiao-Hu Zhou, Shu-Hai Li, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Zeng-Guang Hou• 2023

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL halfcheetah-medium-expert	Normalized Score104.9	169
Offline Reinforcement Learning	D4RL hopper-medium-expert	Normalized Score107.5	161
Offline Reinforcement Learning	D4RL walker2d-medium-expert	Normalized Score111.6	132
Offline Reinforcement Learning	D4RL Medium-Replay Hopper	Normalized Score99.4	109
Offline Reinforcement Learning	D4RL Medium HalfCheetah	Normalized Score74.1	105
Offline Reinforcement Learning	D4RL Medium Walker2d	Normalized Score95.4	104
Offline Reinforcement Learning	D4RL walker2d-random	Normalized Score20.9	101
Offline Reinforcement Learning	D4RL Medium-Replay HalfCheetah	Normalized Score70.4	97
Offline Reinforcement Learning	D4RL halfcheetah-random	Normalized Score33.7	94
Offline Reinforcement Learning	D4RL hopper-random	Normalized Score31.8	86

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord