Flow-Based Policy for Online Reinforcement Learning

About

We present \textbf{FlowRL}, a novel framework for online reinforcement learning that integrates flow-based policy representation with Wasserstein-2-regularized optimization. We argue that in addition to training signals, enhancing the expressiveness of the policy class is crucial for the performance gains in RL. Flow-based generative models offer such potential, excelling at capturing complex, multimodal action distributions. However, their direct application in online RL is challenging due to a fundamental objective mismatch: standard flow training optimizes for static data imitation, while RL requires value-based policy optimization through a dynamic buffer, leading to difficult optimization landscapes. FlowRL first models policies via a state-dependent velocity field, generating actions through deterministic ODE integration from noise. We derive a constrained policy search objective that jointly maximizes Q through the flow policy while bounding the Wasserstein-2 distance to a behavior-optimal policy implicitly derived from the replay buffer. This formulation effectively aligns the flow optimization with the RL objective, enabling efficient and value-aware policy learning despite the complexity of the policy class. Empirical evaluations on DMControl and Humanoidbench demonstrate that FlowRL achieves competitive performance in online reinforcement learning benchmarks.

Lei Lv, Yunfei Li, Yu Luo, Fuchun Sun, Tao Kong, Jiafeng Xu, Xiao Ma• 2025

Related benchmarks

Task	Dataset	Result
Continuous Control	MuJoCo Walker2d v4	--	51
Continuous Control	MuJoCo Ant v4	Average Return5.51e+3	46
Continuous Control	MuJoCo HalfCheetah v4	Average Return8.96e+3	36
Continuous Control	MuJoCo Swimmer v4	Total Reward48.7	19
Locomotion	DeepMind Control suite Dog-Trot	Final Return916	17
Dog-stand	DeepMind Control Suite (DMC)	Total Average Return977	13
H1balance_simple	Humanoid-bench	Total Return314	13
H1crawl	Humanoid-bench	Total Average Return884	13
H1reach	Humanoid-bench	Total Average Return5.18e+3	13
H1sit_hard	Humanoid-bench	Total Average Return728	13

Showing 10 of 21 rows

Other info

Follow for update

@wizwand_team Discord