Fast and Highly Expressive Policy Learning for Offline Reinforcement Learning via Bootstrapped Flow Q-Learning

About

Diffusion-based Q-learning has emerged as a powerful paradigm for offline reinforcement learning, but its reliance on multi-step denoising makes both training and inference computationally expensive and brittle. Recent efforts to accelerate diffusion Q-learning toward single-step action generation typically introduce auxiliary networks, policy distillation, or multi-phase training, which frequently compromise simplicity, stability, or performance. To address these limitations, we introduce Bootstrapped Flow Q-Learning (BFQ), a novel framework that enables accurate single-step action generation during both training and inference, without auxiliary networks or distillation procedures. BFQ adopts a divide-and-conquer view of the displacement vector along the flow path: it begins by learning short-range displacements that can be accurately estimated from the Flow Matching marginal velocity, and bootstraps these components to directly learn a noise-to-action mapping in a single step. This formulation eliminates multi-step denoising, resulting in a learning procedure that is substantially faster, simpler, and more robust. Extensive D4RL evaluations show that BFQ improves performance while significantly reducing computational cost compared to multi-step diffusion baselines, demonstrating that single-step action generation suffices for high-performance offline Reinforcement Learning.

Thanh Nguyen, Tri Ton, Hongbin Choe, Tung M. Luu, Chang D. Yoo• 2026

Related benchmarks

Task	Dataset	Result
Offline Reinforcement Learning	D4RL AntMaze	AntMaze Medium Play Return87	78
Offline Reinforcement Learning	OGBench	AntMaze Giant Navigate0.00e+0	68
Offline Reinforcement Learning	D4RL MuJoCo halfcheetah-medium-expert	Normalized Score98.6	54
Offline Reinforcement Learning	D4RL MuJoCo Hopper medium standard	Normalized Score103.5	47
Offline Reinforcement Learning	D4RL MuJoCo walker2d-medium-expert	Normalized Score113.4	47
Offline Reinforcement Learning	D4RL MuJoCo halfcheetah-medium-replay	Normalized Score0.521	47
Offline Reinforcement Learning	D4RL MuJoCo hopper-medium-expert	Normalized Score110.5	47
Offline Reinforcement Learning	D4RL antmaze-large (play)	Normalized Score0.885	47
Offline Reinforcement Learning	D4RL MuJoCo hopper-medium-replay	Normalized Score102.1	42
Offline Reinforcement Learning	D4RL MuJoCo walker2d-medium	Normalized Score91.7	33

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord