Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Weaving Context Across Images: Improving Vision-Language Models through Focus-Centric Visual Chains

About

Vision-language models (VLMs) achieve remarkable success in single-image tasks. However, real-world scenarios often involve intricate multi-image inputs, leading to a notable performance decline as models struggle to disentangle critical information scattered across complex visual features. In this work, we propose Focus-Centric Visual Chain, a novel paradigm that enhances VLMs'perception, comprehension, and reasoning abilities in multi-image scenarios. To facilitate this paradigm, we propose Focus-Centric Data Synthesis, a scalable bottom-up approach for synthesizing high-quality data with elaborate reasoning paths. Through this approach, We construct VISC-150K, a large-scale dataset with reasoning data in the form of Focus-Centric Visual Chain, specifically designed for multi-image tasks. Experimental results on seven multi-image benchmarks demonstrate that our method achieves average performance gains of 3.16% and 2.24% across two distinct model architectures, without compromising the general vision-language capabilities. our study represents a significant step toward more robust and capable vision-language systems that can handle complex visual scenarios.

Juntian Zhang, Chuanqi cheng, Yuhan Liu, Wei Liu, Jian Luan, Rui Yan• 2025

Related benchmarks

TaskDatasetResultRank
Video UnderstandingMVBench
Accuracy68
247
Multi-image ReasoningMIRB
Accuracy60.2
60
Multi-image UnderstandingMMIU
Accuracy52.8
60
Multi-image ReasoningMuirBench
Accuracy44.5
48
Multi-image ReasoningMANTIS
Accuracy69.1
18
Showing 5 of 5 rows

Other info

Follow for update