GeMPO: Generalized Measure Matching for Online Diffusion Reinforcement Learning

About

A commonly used family of RL algorithms for diffusion policies conducts softmax reweighting over samples from the behavior policy, which often induces an overgreedy policy and fails to utilize feedback from negative samples. In this work, we introduce GeMPO, a simple and unified framework that generalizes reweighting scheme in diffusion RL from softmax to general monotonic functions. GeMPO revisits diffusion RL via a measure matching perspective: First, we construct a virtual target policy measure via solving a regularized policy optimization objective; Second, we minimize the divergence between the current policy and this target measure through reweighted flow matching. This formulation offers two key advantages: i) It extends weight design beyond traditional exponential reweighting, allowing it to be tailored to diverse reward landscapes; and ii) by relaxing the non-negativity constraint on the target measure, our framework provides a principled justification for negative reweighting. We provide interpretations of how negative reweighting actively repels the policy from suboptimal actions and thus facilitates exploration. Extensive empirical evaluations demonstrate that GeMPO achieves competitive or superior performance by leveraging these flexible weighting schemes, and we provide practical guidelines for selecting reweighting methods in practice.

Haitong Ma, Chenxiao Gao, Tianyi Chen, Na Li, Bo Dai• 2026

Related benchmarks

Task	Dataset	Result
Reinforcement Learning	MuJoCo Half-Cheetah	Average Return1.39e+4	28
Reinforcement Learning	MuJoCo Hopper	Average Return3.00e+3	24
Reinforcement Learning	MuJoCo Ant	Average Return5.98e+3	24
Reinforcement Learning	Swimmer	Average Returns69	24
DNA Sequence Generation	Pred-Activity	Pred-Activity7.62	13
Reinforcement Learning	MuJoCo Humanoid	Average Return5.47e+3	12
Reinforcement Learning	Gym-MuJoCo Walker2D	Average Return4.91e+3	10

Showing 7 of 7 rows

Other info

Follow for update

@wizwand_team Discord