LFPO: Likelihood-Free Policy Optimization for Masked Diffusion Models

About

Reinforcement Learning with Verifiable Rewards (RLVR) has achieved remarkable success in improving autoregressive models, especially in domains requiring correctness like mathematical reasoning and code generation. However, directly applying such paradigms to Diffusion Large Language Models (dLLMs) is fundamentally hindered by the intractability of exact likelihood computation, which forces existing methods to rely on high-variance approximations. To bridge this gap, we propose Likelihood-Free Policy Optimization (LFPO), a native framework that maps the concept of vector field flow matching to the discrete token space. Specifically, LFPO formulates alignment as geometric velocity rectification, which directly optimizes denoising logits via contrastive updates. This design effectively bypasses the errors inherent in likelihood approximation, yielding the precise gradient estimation. Furthermore, LFPO enforce consistency by predicting final solutions from intermediate steps, effectively straightening the probability flow to enable high-quality generation with significantly fewer iterations. Extensive experiments demonstrate that LFPO not only outperforms state-of-the-art baselines on code and reasoning benchmarks but also accelerates inference by approximately 20% through reduced diffusion steps.

Chenxing Wei, Jiazhen Kang, Hong Wang, Jianqing Zhang, Hao Jiang, Xiaolong Xu, Ningyuan Sun, Ying He, F. Richard Yu, Yao Shu, Bo Jiang• 2026

Related benchmarks

Task	Dataset	Result
Commonsense Reasoning	WinoGrande	Accuracy86.9	1442
Code Generation	HumanEval	Pass@175.6	1043
Physical Commonsense Reasoning	PIQA	Accuracy85.9	696
Code Generation	HumanEval+	Pass@170.1	393
Mathematical Reasoning	MATH	Accuracy37.6	338
Science Reasoning	GPQA	Accuracy27.1	243
Code Generation	MBPP+	Pass@171.3	238
Common Sense Reasoning	HellaSwag	Accuracy85.7	213
Code Generation	MBPP	Pass@181.6	211
Code Generation	EvalPlus	Pass@169.5	115

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord