Seeing is Believing: Mitigating Hallucination in Large Vision-Language Models via CLIP-Guided Decoding

About

Large Vision-Language Models (LVLMs) are susceptible to object hallucinations, an issue in which their generated text contains non-existent objects, greatly limiting their reliability and practicality. Current approaches often rely on the model's token likelihoods or other internal information, instruction tuning on additional datasets, or incorporating complex external tools. We first perform empirical analysis on sentence-level LVLM hallucination, finding that CLIP similarity to the image acts as a stronger and more robust indicator of hallucination compared to token likelihoods. Motivated by this, we introduce our CLIP-Guided Decoding (CGD) approach, a straightforward but effective training-free approach to reduce object hallucination at decoding time. CGD uses CLIP to guide the model's decoding process by enhancing visual grounding of generated text with the image. Experiments demonstrate that CGD effectively mitigates object hallucination across multiple LVLM families while preserving the utility of text generation. Codes are available at https://github.com/d-ailin/CLIP-Guided-Decoding.

Ailin Deng, Zhirui Chen, Bryan Hooi• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2019
Visual Question Answering	VizWiz	Accuracy49	1820
Multimodal Understanding	MM-Vet	MM-Vet Score30.8	631
Hallucination Evaluation	CHAIR	CHAIR_s46.2	393
Mathematical Reasoning	MathVista mini	Accuracy53.18	135
Multimodal Evaluation	SEED-Bench	Accuracy63.23	112
Multi-modal Understanding	LLaVA-Bench Wild	LLaVA^W Score71.3	86
Multi-modal Understanding	MMBench	Mean Accuracy64.5	63
Object Hallucination in Open-ended Captioning	Chair (test)	--	50
Visual Question Answering	ScienceQA (SQA)	SQA Accuracy64	43

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord