MixGRPO: Unlocking Flow-based GRPO Efficiency with Mixed ODE-SDE

About

Although GRPO substantially enhances flow matching models in human preference alignment of image generation, methods such as FlowGRPO and DanceGRPO still exhibit inefficiency due to the necessity of sampling and optimizing over all denoising steps specified by the Markov Decision Process (MDP). In this paper, we propose $\textbf{MixGRPO}$, a novel framework that leverages the flexibility of mixed sampling strategies through the integration of stochastic differential equations (SDE) and ordinary differential equations (ODE). This streamlines the optimization process within the MDP to improve efficiency and boost performance. Specifically, MixGRPO introduces a sliding window mechanism, using SDE sampling and GRPO-guided optimization only within the window, while applying ODE sampling outside. This design confines sampling randomness to the time-steps within the window, thereby reducing the optimization overhead, and allowing for more focused gradient updates to accelerate convergence. Additionally, as time-steps beyond the sliding window are not involved in optimization, higher-order solvers are supported for faster sampling. So we present a faster variant, termed $\textbf{MixGRPO-Flash}$, which further improves training efficiency while achieving comparable performance. MixGRPO exhibits substantial gains across multiple dimensions of human preference alignment, outperforming DanceGRPO in both effectiveness and efficiency, with nearly 50% lower training time. Notably, MixGRPO-Flash further reduces training time by 71%.

Junzhe Li, Yutao Cui, Tao Huang, Yinping Ma, Chun Fan, Yiming Cheng, Miles Yang, Zhao Zhong, Liefeng Bo• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	Pick-a-Pic 1K prompts v1	ImageReward1.24	28
Text-to-Image Generation	Text-to-Image Preference Evaluation Suite (HPSv2.1, ImageReward, PickScore, Aes.Pred.v2.5, CLIP, Unified Reward) v2.1	HPSv2.10.361	14
Text-to-Image Generation	Out-of-Domain T2I Dataset	Laplacian Variance3.90e+3	13
Text-to-Image Alignment	FLUX.1 dev v1.0 (test)	HPS-v2.10.367	13
Human Preference Alignment	HPD v2	HPS-v2.10.371	10
Human Preference Alignment	Human Preference Alignment In-Domain (test)	Pick Score22.23	7
Human Preference Alignment	Human Preference Alignment Out-of-Domain (test)	HPS-v2.132.4	7
Human Preference Alignment	HPD v2 (test)	HPSv2.10.366	7
Preference Alignment	HPD 2.1 (test)	HPSv315.128	7
Text-to-Image Preference Alignment	HPD v2.1	HPS-v2.1 Score0.367	6

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord