VL-Uncertainty: Detecting Hallucination in Large Vision-Language Model via Uncertainty Estimation

About

Given the higher information load processed by large vision-language models (LVLMs) compared to single-modal LLMs, detecting LVLM hallucinations requires more human and time expense, and thus rise a wider safety concerns. In this paper, we introduce VL-Uncertainty, the first uncertainty-based framework for detecting hallucinations in LVLMs. Different from most existing methods that require ground-truth or pseudo annotations, VL-Uncertainty utilizes uncertainty as an intrinsic metric. We measure uncertainty by analyzing the prediction variance across semantically equivalent but perturbed prompts, including visual and textual data. When LVLMs are highly confident, they provide consistent responses to semantically equivalent queries. However, when uncertain, the responses of the target LVLM become more random. Considering semantically similar answers with different wordings, we cluster LVLM responses based on their semantic content and then calculate the cluster distribution entropy as the uncertainty measure to detect hallucination. Our extensive experiments on 10 LVLMs across four benchmarks, covering both free-form and multi-choice tasks, show that VL-Uncertainty significantly outperforms strong baseline methods in hallucination detection.

Ruiyang Zhang, Hu Zhang, Zhedong Zheng• 2024

Related benchmarks

Task	Dataset	Result
Uncertainty Estimation	AOKVQA	AUC78.1	65
Uncertainty Quantification	OKVQA	AUROC73.1	62
Binary safety hallucination detection	EndoVis18-VQA Out-of-template (val)	Accuracy85	50
Multiple choice question visual question answering	Multiple choice question (MCQ) visual question answering (VQA) benchmarks Average	AUC80.2	36
Self-evaluation	ViLP	AUROC67.4	36
Self-evaluation	MMVet	AUROC0.859	36
Self-evaluation	VisualCoT	AUROC77.7	36
Self-evaluation	CVBench	AUROC0.72	36
Open-ended multimodal understanding	Open-ended multimodal understanding benchmarks Average	AUC0.654	28
Error detection	Unanswerability hallucination detection benchmarks (Average)	AUC56.1	28

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord