Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

One-Step Flow Policy Mirror Descent

About

Diffusion policies have achieved great success in online reinforcement learning (RL) due to their strong expressive capacity. However, the inference of diffusion policy models relies on a slow iterative sampling process, which limits their responsiveness. To overcome this limitation, we propose Flow Policy Mirror Descent (FPMD), an online RL algorithm that enables 1-step sampling during flow policy inference. Our approach exploits a theoretical connection between the distribution variance and the discretization error of single-step sampling in straight interpolation flow matching models, and requires no extra distillation or consistency training. We present two algorithm variants based on rectified flow policy and MeanFlow policy, respectively. Extensive empirical evaluations on MuJoCo and visual DeepMind Control Suite benchmarks demonstrate that our algorithms show strong performance comparable to diffusion policy baselines while requiring orders of magnitude less computational cost during inference.

Tianyi Chen, Haitong Ma, Na Li, Kai Wang, Bo Dai• 2025

Related benchmarks

TaskDatasetResultRank
Continuous ControlMuJoCo Ant v4
Average Return5.76e+3
46
Continuous ControlMuJoCo Walker2d v4--
39
Continuous ControlMuJoCo HalfCheetah v4
Average Return1.10e+4
36
Continuous ControlMuJoCo Swimmer v4
Total Reward62.2
19
Showing 4 of 4 rows

Other info

Follow for update