Unified Reward Model for Multimodal Understanding and Generation
About
Recent advances in human preference alignment have significantly improved multimodal generation and understanding. A key approach is to train reward models that provide supervision signals for preference optimization. However, existing reward models are often task-specific, limiting their adaptability across diverse visual applications. We also argue that a reward model that jointly learning to assess multiple vision tasks may foster a synergistic effect, where improved image understanding enhances image generation assessment, and refined image evaluation benefits video assessment through better frame analysis. To this end, this paper proposes UnifiedReward, the first unified reward model for multimodal understanding and generation assessment. It supports both pairwise ranking and pointwise scoring, providing effective reward signals for vision model preference alignment. Specifically, (1) we first train UnifiedReward on our constructed large-scale human preference dataset, which covers both image and video generation/understanding tasks. (2) Then, we leverage it to automatically construct high-quality pairwise preference data from vision models by progressively filtering their outputs through our two-stage strategy, i.e., pair ranking and point sifting. (3) Finally, we use these data to align vision models with human preferences via Direct Preference Optimization (DPO). Experimental results show that jointly learning to assess diverse visual tasks yields substantial mutual benefits. We further apply our pipeline to both vision understanding and generation, achieving consistent improvements across each domain.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-to-Video Generation | VBench | Quality Score85.46 | 111 | |
| Reward Modeling | VLRewardBench (test) | General60.6 | 24 | |
| Human Preference Evaluation | HPD v2 (test) | Preference Accuracy83.1 | 18 | |
| Human Preference Evaluation | ImageReward (test) | Preference Accuracy0.6382 | 18 | |
| Human Preference Alignment | REACT-Video | Acc (Tie, Overall)41.6 | 12 | |
| Pairwise Preference | GenAI Bench (test) | Accuracy72.38 | 11 | |
| Video Preference Alignment | GenAI-Bench | Alignment Accuracy (w/ties)54.8 | 11 | |
| Pairwise Preference | HPD v3 (test) | Accuracy71.96 | 11 | |
| Image Generation Assessment | GenAI-Bench Image (test) | Accuracy71.5 | 8 | |
| Image Generation Assessment | MMRB2 (test) | Accuracy60 | 8 |