Evolving Diffusion and Flow Matching Policies for Online Reinforcement Learning
About
Diffusion and flow matching policies offer expressive, multimodal action modeling, yet they are frequently unstable in online reinforcement learning (RL) due to intractable likelihoods and gradients propagating through long sampling chains. Conversely, tractable parameterizations such as Gaussians lack the expressiveness needed for complex control -- exposing a persistent tension between optimization stability and representational power. We address this tension with a key structural principle: decoupling optimization from generation. Building on this, we introduce GoRL (Generative Online Reinforcement Learning), an algorithm-agnostic framework that trains expressive policies from scratch by confining policy optimization to a tractable latent space while delegating action synthesis to a conditional generative decoder. Using a two-timescale alternating schedule and anchoring decoder refinement to a fixed prior, GoRL enables stable optimization while continuously expanding expressiveness. Empirically, GoRL consistently outperforms unimodal and generative baselines across diverse continuous-control tasks. Notably, on the challenging HopperStand task, it achieves episodic returns exceeding 870 -- more than $3\times$ that of the strongest baseline -- demonstrating a practical path to policies that are both stable and highly expressive. Our code is publicly available at https://github.com/bennidict23/GoRL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Online Reinforcement Learning | CheetahRun DMControl (final) | Normalized Return902.2 | 5 | |
| Online Reinforcement Learning | DMControl FingerSpin (final) | Normalized Return903.9 | 5 | |
| Online Reinforcement Learning | FingerTurnHard DMControl (final) | Normalized Return884.6 | 5 | |
| Online Reinforcement Learning | FishSwim DMControl (final) | Normalized Return641 | 5 | |
| Online Reinforcement Learning | HopperStand DMControl (final) | Normalized Return874.6 | 5 | |
| Online Reinforcement Learning | WalkerWalk DMControl (final) | Normalized Return919.6 | 5 |