Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Yes, Q-learning Helps Offline In-Context RL

About

Existing offline in-context reinforcement learning (ICRL) methods have predominantly relied on supervised training objectives, which are known to have limitations in offline RL settings. In this study, we explore the integration of RL objectives within an offline ICRL framework. Through experiments on more than 150 GridWorld and MuJoCo environment-derived datasets, we demonstrate that optimizing RL objectives directly improves performance by approximately 30% on average compared to widely adopted Algorithm Distillation (AD), across various dataset coverages, structures, expertise levels, and environmental complexities. Furthermore, in the challenging XLand-MiniGrid environment, RL objectives doubled the performance of AD. Our results also reveal that the addition of conservatism during value learning brings additional improvements in almost all settings tested. Our findings emphasize the importance of aligning ICRL learning objectives with the RL reward-maximization goal, and demonstrate that offline RL is a promising direction for advancing ICRL.

Denis Tarasov, Alexander Nikulin, Ilya Zisman, Albina Klepach, Andrei Polubarov, Nikita Lyubaykin, Alexander Derevyagin, Igor Kiselev, Vladislav Kurenkov• 2025

Related benchmarks

TaskDatasetResultRank
Continuous ControlHCV-25-1 (complete)
NAUC90
4
Continuous ControlHCV-50-1 (complete)
NAUC98
4
Continuous ControlHCV-100-1 complete
NAUC95
4
Continuous ControlANT 25-1 (complete)
NAUC77
4
Continuous ControlANT-50-1 complete
NAUC91
4
Continuous ControlANT-100-1 (complete)
NAUC1.12
4
Continuous ControlHPP-50-1 (complete)
NAUC1.01
4
Continuous ControlHPP-100-1 (complete)
NAUC1.07
4
Continuous ControlWLP-25-1 (complete)
NAUC1.13
4
Continuous ControlWLP-100-1 (complete)
NAUC1.12
4
Showing 10 of 124 rows
...

Other info

Follow for update