RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking

About

Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 88.9% relative to the VLA's initial performance.

Andrew Choi, Wei Xu• 2026

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	Adroit Pen	Success Rate (SR)96.9	13
Reinforcement Learning	antmaze large-play	OSR46.7	9
Offline-to-Online Reinforcement Learning	D4RL antmaze-medium-play	OSR81.7	9
Offline-to-Online Reinforcement Learning	D4RL antmaze-medium-diverse	OSR78.3	9
Reinforcement Learning	antmaze large-diverse	OSR21.7	9
Robotic Manipulation	Adroit Door	Outcome Success Rate (OSR)0.00e+0	9
Robotic Manipulation	adroit-relocate	OSR (Success Rate)0.00e+0	9
Robot Manipulation	vla carrot-onto-plate v1a (low-data)	OSR89.7	7
Robot Manipulation	vla low-data cube-stacking v1a	OSR70.7	7
Robot Manipulation	vla spoon-into-bowl v1a (low-data)	OSR47.8	7

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord