Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Beyond VLM-Based Rewards: Diffusion-Native Latent Reward Modeling

About

Preference optimization for diffusion and flow-matching models relies on reward functions that are both discriminatively robust and computationally efficient. Vision-Language Models (VLMs) have emerged as the primary reward provider, leveraging their rich multimodal priors to guide alignment. However, their computation and memory cost can be substantial, and optimizing a latent diffusion generator through a pixel-space reward introduces a domain mismatch that complicates alignment. In this paper, we propose DiNa-LRM, a diffusion-native latent reward model that formulates preference learning directly on noisy diffusion states. Our method introduces a noise-calibrated Thurstone likelihood with diffusion-noise-dependent uncertainty. DiNa-LRM leverages a pretrained latent diffusion backbone with a timestep-conditioned reward head, and supports inference-time noise ensembling, providing a diffusion-native mechanism for test-time scaling and robust rewarding. Across image alignment benchmarks, DiNa-LRM substantially outperforms existing diffusion-based reward baselines and achieves performance competitive with state-of-the-art VLMs at a fraction of the computational cost. In preference optimization, we demonstrate that DiNa-LRM improves preference optimization dynamics, enabling faster and more resource-efficient model alignment.

Gongye Liu, Bo Yang, Yida Zhi, Zhizhou Zhong, Lei Ke, Didan Deng, Han Gao, Yongxiang Huang, Kaihao Zhang, Hongbo Fu, Wenhan Luo• 2026

Related benchmarks

TaskDatasetResultRank
Human Preference EvaluationHPD v2 (test)
Preference Accuracy84.31
18
Human Preference EvaluationImageReward (test)
Preference Accuracy0.6175
18
Pairwise PreferenceHPD v3 (test)
Accuracy75.04
11
Pairwise PreferenceGenAI Bench (test)
Accuracy68.98
11
Showing 4 of 4 rows

Other info

Follow for update