Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Evolving Diffusion and Flow Matching Policies for Online Reinforcement Learning

About

Diffusion and flow matching policies offer expressive, multimodal action modeling, yet they are frequently unstable in online reinforcement learning (RL) due to intractable likelihoods and gradients propagating through long sampling chains. Conversely, tractable parameterizations such as Gaussians lack the expressiveness needed for complex control -- exposing a persistent tension between optimization stability and representational power. We address this tension with a key structural principle: decoupling optimization from generation. Building on this, we introduce GoRL (Generative Online Reinforcement Learning), an algorithm-agnostic framework that trains expressive policies from scratch by confining policy optimization to a tractable latent space while delegating action synthesis to a conditional generative decoder. Using a two-timescale alternating schedule and anchoring decoder refinement to a fixed prior, GoRL enables stable optimization while continuously expanding expressiveness. Empirically, GoRL consistently outperforms unimodal and generative baselines across diverse continuous-control tasks. Notably, on the challenging HopperStand task, it achieves episodic returns exceeding 870 -- more than $3\times$ that of the strongest baseline -- demonstrating a practical path to policies that are both stable and highly expressive. Our code is publicly available at https://github.com/bennidict23/GoRL.

Chubin Zhang, Zhenglin Wan, Feng Chen, Fuchao Yang, Lang Feng, Yaxin Zhou, Xingrui Yu, Yang You, Ivor Tsang, Bo An• 2025

Related benchmarks

TaskDatasetResultRank
Online Reinforcement LearningCheetahRun DMControl (final)
Normalized Return902.2
5
Online Reinforcement LearningDMControl FingerSpin (final)
Normalized Return903.9
5
Online Reinforcement LearningFingerTurnHard DMControl (final)
Normalized Return884.6
5
Online Reinforcement LearningFishSwim DMControl (final)
Normalized Return641
5
Online Reinforcement LearningHopperStand DMControl (final)
Normalized Return874.6
5
Online Reinforcement LearningWalkerWalk DMControl (final)
Normalized Return919.6
5
Showing 6 of 6 rows

Other info

Follow for update