Scaling Offline RL via Efficient and Expressive Shortcut Models
About
Diffusion and flow models have emerged as powerful generative approaches capable of modeling diverse and multimodal behavior. However, applying these models to offline reinforcement learning (RL) remains challenging due to the iterative nature of their noise sampling processes, making policy optimization difficult. In this paper, we introduce Scalable Offline Reinforcement Learning (SORL), a new offline RL algorithm that leverages shortcut models - a novel class of generative models - to scale both training and inference. SORL's policy can capture complex data distributions and can be trained simply and efficiently in a one-stage training procedure. At test time, SORL introduces both sequential and parallel inference scaling by using the learned Q-function as a verifier. We demonstrate that SORL achieves strong performance across a range of offline RL tasks and exhibits positive scaling behavior with increased test-time compute. We release the code at nico-espinosadice.github.io/projects/sorl.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL AntMaze | AntMaze Medium Play Return80.1 | 78 | |
| Offline Reinforcement Learning | OGBench | AntMaze Giant Navigate12 | 68 | |
| Offline Reinforcement Learning | D4RL MuJoCo halfcheetah-medium-expert | Normalized Score96.5 | 54 | |
| Offline Reinforcement Learning | D4RL MuJoCo halfcheetah-medium-replay | Normalized Score0.483 | 47 | |
| Offline Reinforcement Learning | D4RL MuJoCo Hopper medium standard | Normalized Score81.3 | 47 | |
| Offline Reinforcement Learning | D4RL antmaze-large (play) | Normalized Score0.573 | 47 | |
| Offline Reinforcement Learning | D4RL MuJoCo walker2d-medium-expert | Normalized Score109.1 | 47 | |
| Offline Reinforcement Learning | D4RL MuJoCo hopper-medium-expert | Normalized Score45.9 | 47 | |
| Offline Reinforcement Learning | D4RL MuJoCo hopper-medium-replay | Normalized Score93 | 42 | |
| Offline Reinforcement Learning | D4RL MuJoCo halfcheetah-medium | Normalized Score57.4 | 33 |