Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning
About
Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference, without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Offline Reinforcement Learning | D4RL AntMaze | AntMaze Medium Play Return87 | 78 | |
| Offline Reinforcement Learning | OGBench | AntMaze Giant Navigate0.00e+0 | 68 | |
| Offline Reinforcement Learning | D4RL MuJoCo halfcheetah-medium-expert | Normalized Score98.6 | 54 | |
| Offline Reinforcement Learning | D4RL MuJoCo Hopper medium standard | Normalized Score103.5 | 47 | |
| Offline Reinforcement Learning | D4RL MuJoCo walker2d-medium-expert | Normalized Score113.4 | 47 | |
| Offline Reinforcement Learning | D4RL MuJoCo halfcheetah-medium-replay | Normalized Score0.521 | 47 | |
| Offline Reinforcement Learning | D4RL MuJoCo hopper-medium-expert | Normalized Score110.5 | 47 | |
| Offline Reinforcement Learning | D4RL antmaze-large (play) | Normalized Score0.885 | 47 | |
| Offline Reinforcement Learning | D4RL MuJoCo hopper-medium-replay | Normalized Score102.1 | 42 | |
| Offline Reinforcement Learning | D4RL MuJoCo walker2d-medium | Normalized Score91.7 | 33 |