Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

About

Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

Qiyuan Chen, Hongsen Huang, Jiahe Chen, Qian Shao, Jintai Chen, Hongxia Xu, Renjie Hua, Chuan Ren, Jian Wu• 2026

Related benchmarks

TaskDatasetResultRank
Multimodal Reward ModelingVL-RewardBench
Accuracy73.06
76
Hallucination EvaluationMMHal
Score4.2
37
Multimodal Reward ModelingRewardBench Multimodal
Safety Score51.2
31
Real-world UnderstandingWildVision
Win Rate68.3
25
Image UnderstandingLLaVABenchWilder
Score77.9
8
Image UnderstandingLLaVABench
Score101.9
8
Showing 6 of 6 rows

Other info

Follow for update