Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Don't Miss the Forest for the Trees: Attentional Vision Calibration for Large Vision Language Models

About

Large Vision Language Models (LVLMs) demonstrate strong capabilities in visual understanding and description, yet often suffer from hallucinations, attributing incorrect or misleading features to images. We observe that LVLMs disproportionately focus on a small subset of image tokens--termed blind tokens--which are typically irrelevant to the query (e.g., background or non-object regions). We hypothesize that such attention misalignment plays a key role in generating hallucinated responses. To mitigate this issue, we propose Attentional Vision Calibration (AvisC), a test-time approach that dynamically recalibrates the influence of blind tokens without modifying the underlying attention mechanism. AvisC first identifies blind tokens by analyzing layer-wise attention distributions over image tokens, then employs a contrastive decoding strategy to balance the influence of original and blind-token-biased logits. Experiments on standard benchmarks, including POPE, MME, and AMBER, demonstrate that AvisC effectively reduces hallucinations in LVLMs.

Sangmin Woo, Donguk Kim, Jaehyuk Jang, Yubin Choi, Changick Kim• 2024

Related benchmarks

TaskDatasetResultRank
Hallucination EvaluationMMHal-Bench
MMHal Score2.19
216
Hallucination EvaluationAMBER
CHAIR14.2
172
Object Hallucination EvaluationMS-COCO (POPE Adversarial)
Accuracy87.62
138
Object Hallucination EvaluationMS-COCO POPE (Popular)
Accuracy90.76
108
Object Hallucination EvaluationMS-COCO POPE Random
Accuracy92.36
71
Multimodal Model EvaluationMME
Total Score613.3
71
Medical Visual Question AnsweringSLAKE closed-end
Accuracy91.27
54
Object Hallucination ProbingGQA POPE Popular
Accuracy74.8
49
Medical Visual Question AnsweringVQA-RAD closed-end
Accuracy78.35
45
Object Hallucination ProbingGQA Adversarial
Accuracy69.2
40
Showing 10 of 40 rows

Other info

Code

Follow for update