Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
About
We propose Skywork-VL Reward, a multimodal reward model that provides reward signals for both multimodal understanding and reasoning tasks. Our technical approach comprises two key components: First, we construct a large-scale multimodal preference dataset that covers a wide range of tasks and scenarios, with responses collected from both standard vision-language models (VLMs) and advanced VLM reasoners. Second, we design a reward model architecture based on Qwen2.5-VL-7B-Instruct, integrating a reward head and applying multi-stage fine-tuning using pairwise ranking loss on pairwise preference data. Experimental evaluations show that Skywork-VL Reward achieves state-of-the-art results on multimodal VL-RewardBench and exhibits competitive performance on the text-only RewardBench benchmark. Furthermore, preference data constructed based on our Skywork-VL Reward proves highly effective for training Mixed Preference Optimization (MPO), leading to significant improvements in multimodal reasoning capabilities. Our results underscore Skywork-VL Reward as a significant advancement toward general-purpose, reliable reward models for multimodal alignment. Our model has been publicly released to promote transparency and reproducibility.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Reward Modeling | VL-RewardBench | Accuracy73.1 | 102 | |
| Multimodal Reward Modeling | Multimodal RewardBench | Accuracy91.3 | 50 | |
| Multimodal Reward Modeling | RewardBench Multimodal | Safety Score62 | 44 | |
| Multimodal Reward Modeling | RewardBench MM-RLHF | MCQ Score42.86 | 20 | |
| Video Understanding Reward Modeling | VURB | General Video Understanding58.8 | 18 | |
| Multimodal Reward Modeling | VideoRewardBench | Macro Pairwise Accuracy62.9 | 18 | |
| Multimodal Reward Modeling | MR2Bench Video | Best-of-4 Accuracy46.7 | 18 | |
| Multimodal Reward Modeling | MM-RLHF-RewardBench | Pairwise Accuracy72.4 | 18 | |
| Multimodal Reward Modeling | MR2Bench Image | Best-of-4 Accuracy52.9 | 18 | |
| Video Reward Modeling | VideoRewardBench | Perception (long)65.7 | 16 |