Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

System-Mediated Attention Imbalances Make Vision-Language Models Say Yes

About

Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond `yes'. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.

Tsan Tsai Chan, Varsha Suresh, Anisha Saha, Michael Hahn, Vera Demberg• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringMME--
20
Vision-Language ReasoningWinoground
Simple Acc59.88
9
Visual Question AnsweringSugarCrepe
Simple Accuracy68.96
9
Vision-Language ReasoningBEAF (test)
Simple Accuracy88.4
7
Vision-Language ReasoningHallusionBench (test)
Simple Accuracy53.31
7
Vision-Language ReasoningNaturalBench (test)
Simple Accuracy66.02
7
Vision-Language ReasoningSugarCrepe (test)
Simple Accuracy62.75
7
Vision-Language ReasoningMME (test)
Simple Accuracy78.9
7
Paired-prompt evaluationBEAF (sample)
Simple Accuracy90.67
2
Paired-prompt evaluationHallusionBench
Simple Accuracy52.89
2
Showing 10 of 16 rows

Other info

Follow for update