Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

About

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO and DanceGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for faster sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%.

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Miles Yang, Zhao Zhong, Liefeng Bo• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationPick-a-Pic 1K prompts v1
ImageReward1.24
20
Text-to-Image GenerationText-to-Image Preference Evaluation Suite (HPSv2.1, ImageReward, PickScore, Aes.Pred.v2.5, CLIP, Unified Reward) v2.1
HPSv2.10.361
14
Text-to-Image GenerationOut-of-Domain T2I Dataset
Laplacian Variance3.90e+3
13
Human Preference AlignmentHuman Preference Alignment In-Domain (test)
Pick Score22.23
7
Human Preference AlignmentHuman Preference Alignment Out-of-Domain (test)
HPS-v2.132.4
7
Human Preference AlignmentHPD v2
HPS-v2.10.3521
5
Human Preference AlignmentHPDv2
HPS-v2.10.3649
5
Showing 7 of 7 rows

Other info

Follow for update