Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Adaptive Policy Selection and Fine-Tuning under Interaction Budgets for Offline-to-Online Reinforcement Learning

About

In offline-to-online reinforcement learning (O2O-RL), policies are first safely trained offline using previously collected datasets and then further fine-tuned for tasks via limited online interactions. In a typical O2O-RL pipeline, candidate policies trained with offline RL are evaluated via either off-policy evaluation (OPE) or online evaluation (OE). The policy with the highest estimated value is then deployed and continually fine-tuned. However, this setup has two main issues. First, OPE can be unreliable, making it risky to deploy a policy based solely on those estimates, whereas OE may identify a viable policy with substantial online interaction, which could have been used for fine-tuning. Second--and more importantly--it is also often not possible to determine a priori whether a pretrained policy will improve with post-deployment fine-tuning, especially in non-stationary environments. As a result, procedures committing to a single deployed policy are impractical in many real-world settings. Moreover, a naive remedy that exhaustively fine-tunes all candidates would violate interaction budget constraints and is likewise infeasible. In this paper, we propose a novel adaptive approach for policy selection and fine-tuning under online interaction budgets in O2O-RL. Following the standard pipeline, we first train a set of candidate policies with different offline RL algorithms and hyperparameters; we then perform OPE to obtain initial performance estimates. We next adaptively select and fine-tune the policies based on their predicted performance via an upper-confidence-bound approach thereby making efficient use of online interactions. We demonstrate that our approach improves upon O2O-RL baselines with various benchmarks.

Alper Kamil Bozkurt, Xiaoan Xu, Shangtong Zhang, Miroslav Pajic, Yuichi Motai• 2026

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL Medium-Replay HalfCheetah
Normalized Score95.8
97
LocomotionD4RL walker2d-medium-expert
Normalized Score96.7
90
walker2d locomotionD4RL walker2d medium-replay
Normalized Score92.1
78
hopper locomotionD4RL hopper medium-replay
Normalized Score80.3
71
LocomotionD4RL Walker2d medium--
70
hopper locomotionD4RL hopper-medium-expert
Normalized Score75.4
53
Offline Reinforcement LearningD4RL hopper-random
Mean Normalized Score62.1
21
LocomotionD4RL Cheetah Medium
Mean Return91.8
17
Reinforcement LearningD4RL Ant Medium
D4RL Score82.3
7
LocomotionD4RL hopper-random
Mean Return63.3
5
Showing 10 of 30 rows

Other info

Follow for update