Mitigating Object Hallucinations in Large Vision-Language Models with Assembly of Global and Local Attention

About

Despite great success across various multimodal tasks, Large Vision-Language Models (LVLMs) often encounter object hallucinations with generated textual responses being inconsistent with the actual objects in images. We examine different LVLMs and pinpoint that one root cause of object hallucinations lies with deficient attention on discriminative image features. Specifically, LVLMs often predominantly attend to prompt-irrelevant global features instead of prompt-relevant local features, undermining their visual grounding capacity and leading to object hallucinations. We propose Assembly of Global and Local Attention (AGLA), a training-free and plug-and-play approach that mitigates hallucinations by assembling global features for response generation and local features for visual discrimination simultaneously. Specifically, we introduce an image-prompt matching scheme that captures prompt-relevant local features from images, leading to an augmented view of the input image where prompt-relevant content is highlighted while irrelevant distractions are suppressed. Hallucinations can thus be mitigated with a calibrated logit distribution that is from generative global features of the original image and discriminative local features of the augmented image. Extensive experiments show the superiority of AGLA in LVLM hallucination mitigation, demonstrating its wide applicability across both discriminative and generative tasks. Our code is available at https://github.com/Lackel/AGLA.

Wenbin An, Feng Tian, Sicong Leng, Jiahao Nie, Haonan Lin, QianYing Wang, Ping Chen, Xiaoqin Zhang, Shijian Lu• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy85.5	2019
Multimodal Evaluation	MME	--	727
Hallucination Evaluation	CHAIR	CHAIR_s54.8	393
Multimodal Capability Evaluation	MM-Vet	Score46.88	393
Object Hallucination	POPE Popular	F1 Score87.75	372
Object Hallucination	POPE Adversarial	Accuracy86.87	353
Object Hallucination	POPE (Random)	F1 Score88.97	324
Hallucination Evaluation	MMHal-Bench	MMHal Score2.14	306
Hallucination Evaluation	POPE	--	217
Object Hallucination Evaluation	MS-COCO (POPE Adversarial)	Accuracy83.13	190

Showing 10 of 34 rows

Other info

Follow for update

@wizwand_team Discord