CROP: Conservative Reward for Model-based Offline Policy Optimization
About
Offline reinforcement learning (RL) aims to optimize a policy using collected data without online interactions. Model-based approaches are particularly appealing for addressing offline RL challenges because of their capability to mitigate the limitations of data coverage through data generation using models. Nonetheless, a prevalent issue in offline RL is the overestimation caused by distribution shift. This study proposes a novel model-based offline RL algorithm named Conservative Reward for model-based Offline Policy optimization (CROP). CROP introduces a streamlined objective that concurrently minimizes estimation error and the rewards of random actions, thereby yielding a robustly conservative reward estimator. Theoretical analysis shows that the designed conservative reward mechanism leads to a conservative policy evaluation and mitigates distribution shift. Experiments showcase that with the simple modification to reward estimation, CROP can conservatively estimate the reward and achieve competitive performance with existing methods. The source code will be available after acceptance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL halfcheetah-medium-expert | Normalized Score104.9 | 155 | |
| Offline Reinforcement Learning | D4RL hopper-medium-expert | Normalized Score107.5 | 153 | |
| Offline Reinforcement Learning | D4RL walker2d-medium-expert | Normalized Score111.6 | 124 | |
| Offline Reinforcement Learning | D4RL Medium HalfCheetah | Normalized Score74.1 | 97 | |
| Offline Reinforcement Learning | D4RL Medium-Replay Hopper | Normalized Score99.4 | 97 | |
| Offline Reinforcement Learning | D4RL Medium Walker2d | Normalized Score95.4 | 96 | |
| Offline Reinforcement Learning | D4RL walker2d-random | Normalized Score20.9 | 93 | |
| Offline Reinforcement Learning | D4RL halfcheetah-random | Normalized Score33.7 | 86 | |
| Offline Reinforcement Learning | D4RL Medium-Replay HalfCheetah | Normalized Score70.4 | 84 | |
| Offline Reinforcement Learning | D4RL hopper-random | Normalized Score31.8 | 78 |