Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning

About

A compelling use case of offline reinforcement learning (RL) is to obtain a policy initialization from existing datasets followed by fast online fine-tuning with limited interaction. However, existing offline RL methods tend to behave poorly during fine-tuning. In this paper, we devise an approach for learning an effective initialization from offline data that also enables fast online fine-tuning capabilities. Our approach, calibrated Q-learning (Cal-QL), accomplishes this by learning a conservative value function initialization that underestimates the value of the learned policy from offline data, while also being calibrated, in the sense that the learned Q-values are at a reasonable scale. We refer to this property as calibration, and define it formally as providing a lower bound on the true value function of the learned policy and an upper bound on the value of some other (suboptimal) reference policy, which may simply be the behavior policy. We show that offline RL algorithms that learn such calibrated value functions lead to effective online fine-tuning, enabling us to take the benefits of offline initializations in online fine-tuning. In practice, Cal-QL can be implemented on top of the conservative Q learning (CQL) for offline RL within a one-line code change. Empirically, Cal-QL outperforms state-of-the-art methods on 9/11 fine-tuning benchmark tasks that we study in this paper. Code and video are available at https://nakamotoo.github.io/Cal-QL

Mitsuhiko Nakamoto, Yuexiang Zhai, Anikait Singh, Max Sobol Mark, Yi Ma, Chelsea Finn, Aviral Kumar, Sergey Levine• 2023

Related benchmarks

TaskDatasetResultRank
LocomotionD4RL Halfcheetah medium--
44
Offline Reinforcement LearningD4RL (various)
HalfCheetah-Medium47.6
16
Online Fine-tuningD4RL MuJoCo and Maze2D online fine-tuning v2 v0
Normalized Return98
14
LocomotionD4RL Hopper medium
Normalized Score92.83
14
Goal-conditioned navigationD4RL AntMaze
Score0.89
12
NavigationD4RL AntMaze umaze v2
Initial D4RL Score104.6
12
Goal-conditioned manipulationOGBench puzzle-4x4-play
Score20
12
Goal-conditioned navigationOGBench antmaze-giant-navigate
Score2
12
Goal-conditioned manipulationOGBench cube-double-play
Score2
12
Goal-conditioned navigationOGBench humanoidmaze-large-navigate
Score0.00e+0
12
Showing 10 of 21 rows

Other info

Follow for update