Unified Reward Model for Multimodal Understanding and Generation

About

Recent advances in human preference alignment have significantly improved multimodal generation and understanding. A key approach is to train reward models that provide supervision signals for preference optimization. However, existing reward models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that a reward model that jointly learning to assess multiple vision tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment. It supports both pairwise ranking and pointwise scoring, providing effective reward signals for vision model preference alignment. Specifically, (1) we first train UnifiedReward on our constructed large-scale human preference dataset, which covers both image and video generation/understanding tasks. (2) Then, we leverage it to automatically construct high-quality pairwise preference data from vision models by progressively filtering their outputs through our two-stage strategy, i.e., pair ranking and point sifting. (3) Finally, we use these data to align vision models with human preferences via Direct Preference Optimization (DPO). Experimental results show that jointly learning to assess diverse visual tasks yields substantial mutual benefits. We further apply our pipeline to both vision understanding and generation, achieving consistent improvements across each domain.

Yibin Wang, Yuhang Zang, Hao Li, Cheng Jin, Jiaqi Wang• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score0.83	277
Text-to-Image Generation	GenEval (test)	Two Obj. Acc82.57	250
Text-to-Video Generation	VBench	Quality Score85.46	168
Multimodal Reward Modeling	VL-RewardBench	Accuracy71.45	102
Multimodal Reward Modeling	Multimodal RewardBench	Accuracy88.6	50
Multimodal Reward Modeling	RewardBench Multimodal	Safety Score55.3	44
Text-to-Image Generation	Pick-a-Pic (test)	PickScore21.4542	43
Reward Modeling	VLRewardBench (test)	General60.6	39
Human Preference Evaluation	HPD v2 (test)	Preference Accuracy83.1	32
Human Preference Evaluation	ImageReward (test)	Preference Accuracy0.6382	32

Showing 10 of 57 rows

Other info

Follow for update

@wizwand_team Discord