Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

About

Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 88.9% relative to the VLA's initial performance.

Andrew Choi, Wei Xu• 2026

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationAdroit Pen
Success Rate (SR)96.9
13
Reinforcement Learningantmaze large-play
OSR46.7
9
Offline-to-Online Reinforcement LearningD4RL antmaze-medium-play
OSR81.7
9
Offline-to-Online Reinforcement LearningD4RL antmaze-medium-diverse
OSR78.3
9
Reinforcement Learningantmaze large-diverse
OSR21.7
9
Robotic ManipulationAdroit Door
Outcome Success Rate (OSR)0.00e+0
9
Robotic Manipulationadroit-relocate
OSR (Success Rate)0.00e+0
9
Robot Manipulationvla carrot-onto-plate v1a (low-data)
OSR89.7
7
Robot Manipulationvla low-data cube-stacking v1a
OSR70.7
7
Robot Manipulationvla spoon-into-bowl v1a (low-data)
OSR47.8
7
Showing 10 of 12 rows

Other info

Follow for update