Advantage Weighted Matching: Aligning RL with Pretraining in Diffusion Models
About
Reinforcement Learning (RL) has emerged as a central paradigm for advancing Large Language Models (LLMs), where pre-training and RL post-training share the same log-likelihood formulation. In contrast, recent RL approaches for diffusion models, most notably Denoising Diffusion Policy Optimization (DDPO), optimize an objective different from the pretraining objectives--score/flow matching loss. In this work, we establish a novel theoretical analysis: DDPO is an implicit form of score/flow matching with noisy targets, which increases variance and slows convergence. Building on this analysis, we introduce \textbf{Advantage Weighted Matching (AWM)}, a policy-gradient method for diffusion. It uses the same score/flow-matching loss as pretraining to obtain a lower-variance objective and reweights each sample by its advantage. In effect, AWM raises the influence of high-reward samples and suppresses low-reward ones while keeping the modeling objective identical to pretraining. This unifies pretraining and RL conceptually and practically, is consistent with policy-gradient theory, reduces variance, and yields faster convergence. This simple yet effective design yields substantial benefits: on GenEval, OCR, and PickScore benchmarks, AWM delivers up to a $24\times$ speedup over Flow-GRPO (which builds on DDPO), when applied to Stable Diffusion 3.5 Medium and FLUX, without compromising generation quality. Code is available at https://github.com/scxue/advantage_weighted_matching.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Image Generation | GenEval | Overall Score91 | 391 | |
| Compositional Image Generation | GenEval | Overall Score0.91 | 44 | |
| Text-to-Image Generation | DrawBench Visual Text Rendering | PickScore22.36 | 17 | |
| Visual Text Rendering | OCR prompts (test) | OCR Accuracy97 | 9 | |
| Compositional Image Generation | DrawBench | Aesthetics Score5.25 | 9 | |
| Text-to-Image Alignment | PickScore prompts on FLUX.1 | ImageReward Score1.6693 | 5 |