S2H-DPO: Hardness-Aware Preference Optimization for Vision-Language Models

About

Vision-Language Models (VLMs) have demonstrated remarkable progress in single-image understanding, yet effective reasoning across multiple images remains challenging. We identify a critical capability gap in existing multi-image alignment approaches: current methods focus primarily on localized reasoning with pre-specified image indices (``Look at Image 3 and...''), bypassing the essential skills of global visual search and autonomous cross-image comparison. To address this limitation, we introduce a Simple-to-Hard (S2H) learning framework that systematically constructs multi-image preference data across three hierarchical reasoning levels requiring an increasing level of capabilities: (1) single-image localized reasoning, (2) multi-image localized comparison, and (3) global visual search. Unlike prior work that relies on model-specific attributes, such as hallucinations or attention heuristics, to generate preference pairs, our approach leverages prompt-driven complexity to create chosen/rejected pairs that are applicable across different models. Through extensive evaluations on LLaVA and Qwen-VL models, we show that our diverse multi-image reasoning data significantly enhances multi-image reasoning performance, yielding significant improvements over baseline methods across benchmarks. Importantly, our approach maintains strong single-image reasoning performance while simultaneously strengthening multi-image understanding capabilities, thus advancing the state of the art for holistic visual preference alignment.

Nitish Shukla, Surgan Jandial, Arun Ross• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Multi-Image Visual Reasoning	BLINK	Accuracy55.85	51
Natural Language Visual Reasoning	NLVR2	Accuracy74.67	41
Multi-image Reasoning	MANTIS	Accuracy81.71	38
Single-image Reasoning	MMStar	Accuracy62.47	17

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord