System-Mediated Attention Imbalances Make Vision-Language Models Say Yes
About
Vision-language model (VLM) hallucination is commonly linked to imbalanced allocation of attention across input modalities: system, image and text. However, existing mitigation strategies tend towards an image-centric interpretation of these imbalances, often prioritising increased image attention while giving less consideration to the roles of the other modalities. In this study, we evaluate a more holistic, system-mediated account, which attributes these imbalances to functionally redundant system weights that reduce attention to image and textual inputs. We show that this framework offers a useful empirical perspective on the yes-bias, a common form of hallucination in which VLMs indiscriminately respond `yes'. Causally redistributing attention from the system modality to image and textual inputs substantially suppresses this bias, often outperforming existing approaches. We further present evidence suggesting that system-mediated attention imbalances contribute to the yes-bias by encouraging a default reliance on coarse input representations, which are effective for some tasks but ill-suited to others. Taken together, these findings firmly establish system attention as a key factor in VLM hallucination and highlight its potential as a lever for mitigation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | MME | -- | 20 | |
| Vision-Language Reasoning | Winoground | Simple Acc59.88 | 9 | |
| Visual Question Answering | SugarCrepe | Simple Accuracy68.96 | 9 | |
| Vision-Language Reasoning | BEAF (test) | Simple Accuracy88.4 | 7 | |
| Vision-Language Reasoning | HallusionBench (test) | Simple Accuracy53.31 | 7 | |
| Vision-Language Reasoning | NaturalBench (test) | Simple Accuracy66.02 | 7 | |
| Vision-Language Reasoning | SugarCrepe (test) | Simple Accuracy62.75 | 7 | |
| Vision-Language Reasoning | MME (test) | Simple Accuracy78.9 | 7 | |
| Paired-prompt evaluation | BEAF (sample) | Simple Accuracy90.67 | 2 | |
| Paired-prompt evaluation | HallusionBench | Simple Accuracy52.89 | 2 |