Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

About

Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

Tianren Ma, Mu Zhang, Yibing Wang, Qixiang Ye• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score80	704
Text-to-Image Generation	DPG-Bench	Overall Score82.24	451
Text-to-Image Generation	GenEval	Overall Score0.88	277
Text-to-Image Generation	GenEval	Overall Score73	218
Human preference, image quality, and aesthetics comparison	DPG-Bench	DeQA Score4.08	8
Text-to-Image Generation	Human Preference Evaluation Set	DEQA4.35	6

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord