Learning What Matters: Dynamic Dimension Selection and Aggregation for Interpretable Vision-Language Reward Modeling

About

Vision-language reward modeling faces a dilemma: generative approaches are interpretable but slow, while discriminative ones are efficient but act as opaque "black boxes." To bridge this gap, we propose VL-MDR (Vision-Language Multi-Dimensional Reward), a framework that dynamically decomposes evaluation into granular, interpretable dimensions. Instead of outputting a monolithic scalar, VL-MDR employs a visual-aware gating mechanism to identify relevant dimensions and adaptively weight them (e.g., Hallucination, Reasoning) for each specific input. To support this, we curate a dataset of 321k vision-language preference pairs annotated across 21 fine-grained dimensions. Extensive experiments show that VL-MDR consistently outperforms existing open-source reward models on benchmarks like VL-RewardBench. Furthermore, we show that VL-MDR-constructed preference pairs effectively enable DPO alignment to mitigate visual hallucinations and improve reliability, providing a scalable solution for VLM alignment.

Qiyuan Chen, Hongsen Huang, Jiahe Chen, Qian Shao, Jintai Chen, Hongxia Xu, Renjie Hua, Chuan Ren, Jian Wu• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Reward Modeling	VL-RewardBench	Accuracy73.06	102
Multimodal Reward Modeling	RewardBench Multimodal	Safety Score51.2	44
Hallucination Evaluation	MMHal	Score4.2	37
Real-world Understanding	WildVision	Win Rate68.3	25
Image Understanding	LLaVABenchWilder	Score77.9	8
Image Understanding	LLaVABench	Score101.9	8

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord