Training Diffusion Policies via Prior-Mapping Co-Evolution

About

Reinforcement learning (RL) faces a persistent tension: policies that are stable to optimize (e.g., Gaussians) are often too simple to represent the multimodal action distributions required for complex control. Conversely, expressive generative policies -- such as diffusion and flow matching -- can be difficult to optimize in online RL due to intractable likelihoods and gradients propagating through long sampling chains. We address this tension with a key structural principle: decoupling optimization from generation. Building on this, we introduce GoRL (Generative Online Reinforcement Learning), an algorithm-agnostic framework that trains expressive policies from scratch by confining policy optimization to a tractable latent space while delegating action synthesis to a conditional generative decoder. Viewed as prior-mapping co-evolution, each stage first improves a tractable latent prior through RL and then consolidates the resulting behavior into a more expressive prior-to-action mapping. This two-timescale schedule, anchored by fixed-prior decoder refinement, enables stable optimization while continuously expanding expressiveness. Empirically, \textsc{GoRL} consistently outperforms unimodal and generative baselines across diverse continuous-control tasks. Notably, GoRL achieves returns exceeding 870 on HopperStand, more than 3* the strongest baseline; on high-dimensional humanoid tasks, it further outperforms the strongest non-GoRL baseline by over an order of magnitude.

Chubin Zhang, Zhenglin Wan, Feng Chen, Fuchao Yang, Lang Feng, Yaxin Zhou, Xingrui Yu, Yang You, Ivor Tsang, Bo An• 2025

Related benchmarks

Task	Dataset	Result
Online Reinforcement Learning	CheetahRun DMControl (final)	Normalized Return902.2	5
Online Reinforcement Learning	DMControl FingerSpin (final)	Normalized Return903.9	5
Online Reinforcement Learning	FingerTurnHard DMControl (final)	Normalized Return884.6	5
Online Reinforcement Learning	FishSwim DMControl (final)	Normalized Return641	5
Online Reinforcement Learning	HopperStand DMControl (final)	Normalized Return874.6	5
Online Reinforcement Learning	WalkerWalk DMControl (final)	Normalized Return919.6	5

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord