Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling Offline RL via Efficient and Expressive Shortcut Models

About

Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at nico-espinosadice.github.io/projects/sorl.

Nicolas Espinosa-Dice, Yiyi Zhang, Yiding Chen, Bradley Guo, Owen Oertell, Gokul Swamy, Kiante Brantley, Wen Sun• 2025

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL AntMaze
AntMaze Medium Play Return80.1
78
Offline Reinforcement LearningOGBench
AntMaze Giant Navigate12
68
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium-expert
Normalized Score96.5
54
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium-replay
Normalized Score0.483
47
Offline Reinforcement LearningD4RL MuJoCo Hopper medium standard
Normalized Score81.3
47
Offline Reinforcement LearningD4RL antmaze-large (play)
Normalized Score0.573
47
Offline Reinforcement LearningD4RL MuJoCo walker2d-medium-expert
Normalized Score109.1
47
Offline Reinforcement LearningD4RL MuJoCo hopper-medium-expert
Normalized Score45.9
47
Offline Reinforcement LearningD4RL MuJoCo hopper-medium-replay
Normalized Score93
42
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium
Normalized Score57.4
33
Showing 10 of 23 rows

Other info

Follow for update