RankQ: Offline-to-Online Reinforcement Learning via Self-Supervised Action Ranking
About
Offline-to-online reinforcement learning (RL) improves sample efficiency by leveraging pre-collected datasets prior to online interaction. A key challenge, however, is learning an accurate critic in large state--action spaces with limited dataset coverage. To mitigate harmful updates from value overestimation, prior methods impose pessimism by down-weighting out-of-distribution (OOD) actions relative to dataset actions. While effective, this essentially acts as a behavior cloning anchor and can hinder downstream online policy improvement when dataset actions are suboptimal. We propose RankQ, an offline-to-online Q-learning objective that augments temporal-difference learning with a self-supervised multi-term ranking loss to enforce structured action ordering. By learning relative action preferences rather than uniformly penalizing unseen actions, RankQ shapes the Q-function such that action gradients are directed toward higher-quality behaviors. Across sparse reward D4RL benchmarks, RankQ achieves performance competitive with or superior to seven prior methods. In vision-based robot learning, RankQ enables effective offline-to-online fine-tuning of a pretrained vision-language-action (VLA) model in a low-data regime, achieving on average a 42.7% higher simulation success rate than the next best method. In a high-data setting, RankQ improves simulation performance by 13.7% over the next best method and achieves strong sim-to-real transfer, increasing real-world cube stacking success from 43.1% to 88.9% relative to the VLA's initial performance.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | Adroit Pen | Success Rate (SR)96.9 | 13 | |
| Reinforcement Learning | antmaze large-play | OSR46.7 | 9 | |
| Offline-to-Online Reinforcement Learning | D4RL antmaze-medium-play | OSR81.7 | 9 | |
| Offline-to-Online Reinforcement Learning | D4RL antmaze-medium-diverse | OSR78.3 | 9 | |
| Reinforcement Learning | antmaze large-diverse | OSR21.7 | 9 | |
| Robotic Manipulation | Adroit Door | Outcome Success Rate (OSR)0.00e+0 | 9 | |
| Robotic Manipulation | adroit-relocate | OSR (Success Rate)0.00e+0 | 9 | |
| Robot Manipulation | vla carrot-onto-plate v1a (low-data) | OSR89.7 | 7 | |
| Robot Manipulation | vla low-data cube-stacking v1a | OSR70.7 | 7 | |
| Robot Manipulation | vla spoon-into-bowl v1a (low-data) | OSR47.8 | 7 |