Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models

About

Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce \textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $24\times$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.

Shuchen Xue, Chongjian Ge, Shilong Zhang, Yichen Li, Zhi-Ming Ma• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score91	914
Compositional Image Generation	GenEval	Overall Score0.91	94
Text-to-Image Generation	DrawBench Visual Text Rendering	PickScore22.36	17
Visual Text Rendering	OCR prompts (test)	OCR Accuracy97	9
Compositional Image Generation	DrawBench	Aesthetics Score5.25	9
Visual Text Rendering	DrawBench Held-out Prompts (test)	OCR Accuracy97	5
Human Preference Alignment	DrawBench Held-out (test)	PickScore (Training Reward)23.39	5
Text-to-Image Alignment	PickScore prompts on FLUX.1	ImageReward Score1.6693	5
Compositional Image Generation	DrawBench Held-out (test)	GenEval83	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord