Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

About

Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference, without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.

Thanh Nguyen, Tri Ton, Hongbin Choe, Tung M. Luu, Chang D. Yoo• 2026

Related benchmarks

TaskDatasetResultRank
Offline Reinforcement LearningD4RL AntMaze
AntMaze Medium Play Return87
78
Offline Reinforcement LearningOGBench
AntMaze Giant Navigate0.00e+0
68
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium-expert
Normalized Score98.6
54
Offline Reinforcement LearningD4RL MuJoCo Hopper medium standard
Normalized Score103.5
47
Offline Reinforcement LearningD4RL MuJoCo walker2d-medium-expert
Normalized Score113.4
47
Offline Reinforcement LearningD4RL MuJoCo halfcheetah-medium-replay
Normalized Score0.521
47
Offline Reinforcement LearningD4RL MuJoCo hopper-medium-expert
Normalized Score110.5
47
Offline Reinforcement LearningD4RL antmaze-large (play)
Normalized Score0.885
47
Offline Reinforcement LearningD4RL MuJoCo hopper-medium-replay
Normalized Score102.1
42
Offline Reinforcement LearningD4RL MuJoCo walker2d-medium
Normalized Score91.7
33
Showing 10 of 15 rows

Other info

Follow for update