Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

About

Vision-language models (VLMs) achieve remarkable success in single-image tasks. However, real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline as models struggle to disentangle critical information scattered across complex visual features. In this work, we propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios. To facilitate this paradigm, we propose Focus-Centric Data Synthesis, a scalable bottom-up approach for synthesizing high-quality data with elaborate reasoning paths. Through this approach, We construct VISC-150K, a large-scale dataset with reasoning data in the form of Focus-Centric Visual Chain, specifically designed for multi-image tasks. Experimental results on seven multi-image benchmarks demonstrate that our method achieves average performance gains of 3.16% and 2.24% across two distinct model architectures, without compromising the general vision-language capabilities. our study represents a significant step toward more robust and capable vision-language systems that can handle complex visual scenarios.

Juntian Zhang, Chuanqi cheng, Yuhan Liu, Wei Liu, Jian Luan, Rui Yan• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy68	563
Multi-image Reasoning	MuirBench	Accuracy44.5	89
Multi-image Reasoning	MIRB	Accuracy60.2	70
Multi-image Understanding	MMIU	Accuracy52.8	65
Multi-image Reasoning	MANTIS	Accuracy69.1	38

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord