Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Optimistic Critic Reconstruction and Constrained Fine-Tuning for General Offline-to-Online RL

About

Offline-to-online (O2O) reinforcement learning (RL) provides an effective means of leveraging an offline pre-trained policy as initialization to improve performance rapidly with limited online interactions. Recent studies often design fine-tuning strategies for a specific offline RL method and cannot perform general O2O learning from any offline method. To deal with this problem, we disclose that there are evaluation and improvement mismatches between the offline dataset and the online environment, which hinders the direct application of pre-trained policies to online fine-tuning. In this paper, we propose to handle these two mismatches simultaneously, which aims to achieve general O2O learning from any offline method to any online method. Before online fine-tuning, we re-evaluate the pessimistic critic trained on the offline dataset in an optimistic way and then calibrate the misaligned critic with the reliable offline actor to avoid erroneous update. After obtaining an optimistic and and aligned critic, we perform constrained fine-tuning to combat distribution shift during online learning. We show empirically that the proposed method can achieve stable and efficient performance improvement on multiple simulated tasks when compared to the state-of-the-art methods.

Qin-Wen Luo, Ming-Kun Xie, Ye-Wen Wang, Sheng-Jun Huang• 2024

Related benchmarks

TaskDatasetResultRank
hopper locomotionD4RL hopper medium-replay
Normalized Score105.1
56
walker2d locomotionD4RL walker2d medium-replay--
53
LocomotionD4RL walker2d-medium-expert
Normalized Score121.4
47
LocomotionD4RL Halfcheetah medium--
44
LocomotionD4RL Walker2d medium--
44
LocomotionD4RL halfcheetah-medium-expert
Normalized Score100.4
37
LocomotionD4RL HalfCheetah Medium-Replay
Normalized Score0.7165
33
LocomotionD4RL hopper-medium-expert
Normalized Score (100k Steps)107.6
18
LocomotionD4RL Hopper medium
Normalized Score100.4
14
NavigationD4RL AntMaze umaze v2
Initial D4RL Score133.7
12
Showing 10 of 18 rows

Other info

Code

Follow for update