Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CROP: Conservative Reward for Model-based Offline Policy Optimization

About

Offline reinforcement learning (RL) aims to optimize a policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges because of their capability to mitigate the limitations of data coverage through data generation using models. Nonetheless, a prevalent issue in offline RL is the overestimation caused by distribution shift. This study proposes a novel model-based offline RL algorithm named Conservative Reward for model-based Offline Policy optimization (CROP). CROP introduces a streamlined objective that concurrently minimizes estimation error and the rewards of random actions, thereby yielding a robustly conservative reward estimator. Theoretical analysis shows that the designed conservative reward mechanism leads to a conservative policy evaluation and mitigates distribution shift. Experiments showcase that with the simple modification to reward estimation, CROP can conservatively estimate the reward and achieve competitive performance with existing methods. The source code will be available after acceptance.

Hao Li, Xiao-Hu Zhou, Shu-Hai Li, Mei-Jiang Gui, Xiao-Liang Xie, Shi-Qi Liu, Shuang-Yi Wang, Zhen-Qiu Feng, Zeng-Guang Hou• 2023

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL halfcheetah-medium-expert
Normalized Score104.9
155
Offline Reinforcement LearningD4RL hopper-medium-expert
Normalized Score107.5
153
Offline Reinforcement LearningD4RL walker2d-medium-expert
Normalized Score111.6
124
Offline Reinforcement LearningD4RL Medium HalfCheetah
Normalized Score74.1
97
Offline Reinforcement LearningD4RL Medium-Replay Hopper
Normalized Score99.4
97
Offline Reinforcement LearningD4RL Medium Walker2d
Normalized Score95.4
96
Offline Reinforcement LearningD4RL walker2d-random
Normalized Score20.9
93
Offline Reinforcement LearningD4RL halfcheetah-random
Normalized Score33.7
86
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score70.4
84
Offline Reinforcement LearningD4RL hopper-random
Normalized Score31.8
78
Showing 10 of 12 rows

Other info

Follow for update