System-Mediated Attention Imbalances Make Vision-Language Models Say Yes

About

Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond `yes'. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.

Tsan Tsai Chan, Varsha Suresh, Anisha Saha, Michael Hahn, Vera Demberg• 2026

Related benchmarks

Task	Dataset	Result
Visual Question Answering	MME	--	20
Visual Question Answering	HallusionBench	Simple Accuracy51.31	19
Vision-Language Reasoning	Winoground	Simple Acc59.88	9
Visual Question Answering	SugarCrepe	Simple Accuracy68.96	9
Vision-Language Reasoning	BEAF (test)	Simple Accuracy88.4	7
Vision-Language Reasoning	HallusionBench (test)	Simple Accuracy53.31	7
Vision-Language Reasoning	NaturalBench (test)	Simple Accuracy66.02	7
Vision-Language Reasoning	SugarCrepe (test)	Simple Accuracy62.75	7
Vision-Language Reasoning	MME (test)	Simple Accuracy78.9	7
Paired-prompt evaluation	BEAF (sample)	Simple Accuracy90.67	2

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord