Identifying and Mitigating Position Bias of Multi-image Vision-Language Models

About

The evolution of Large Vision-Language Models (LVLMs) has progressed from single to multi-image reasoning. Despite this advancement, our findings indicate that LVLMs struggle to robustly utilize information across multiple images, with predictions significantly affected by the alteration of image positions. To further explore this issue, we introduce Position-wise Question Answering (PQA), a meticulously designed task to quantify reasoning capabilities at each position. Our analysis reveals a pronounced position bias in LVLMs: open-source models excel in reasoning with images positioned later but underperform with those in the middle or at the beginning, while proprietary models show improved comprehension for images at the beginning and end but struggle with those in the middle. Motivated by this, we propose SoFt Attention (SoFA), a simple, training-free approach that mitigates this bias by employing linear interpolation between inter-image causal attention and bidirectional counterparts. Experimental results demonstrate that SoFA reduces position bias and enhances the reasoning performance of existing LVLMs.

Xinyu Tian, Shu Zou, Zhaoyuan Yang, Jing Zhang• 2025

Related benchmarks

Task	Dataset	Result
Video Understanding	MVBench	Accuracy57.71	563
Multi-image Reasoning	MIRB	Accuracy60.67	70
Visual Reasoning	NLVR2	Accuracy90.26	49
Multimodal Reasoning	MuirBench	Accuracy57.14	21
Multimodal Reasoning	BLINK	Accuracy55.92	20
Multimodal Reasoning	Mantis-Eval	Accuracy59.23	11
Multiple Choice Selection Accuracy	LLaVA Random N=8 OneVision (full evaluation set)	Accuracy32.16	4
Multiple Choice Selection Accuracy	LLaVA Adv N=8 OneVision (full evaluation set)	Accuracy25.34	4
Multiple Choice Selection Accuracy	LLaVA Random N=4 full OneVision (evaluation)	Accuracy59.7	4
Multiple Choice Selection Accuracy	LLaVA Adv N=4 OneVision (full evaluation set)	Accuracy43.26	4

Showing 10 of 10 rows

Other info

Code

Follow for update

@wizwand_team Discord