Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MaskCD: Mitigating LVLM Hallucinations by Image Head Masked Contrastive Decoding

About

Large vision-language models (LVLMs) have shown remarkable performance in visual-language understanding for downstream multimodal tasks. While their capabilities are improving, problems emerge simultaneously. Among those problems, the hallucinations have attracted much attention, which stands for the phenomenon where LVLMs generate contradictory content to their input visual and text contents. Many approaches have been proposed to deal with this issue, such as contrastive decoding and attention manipulation. However, contrastive decoding methods struggle in constructing appropriate contrastive samples, and attention manipulation methods are highly sensitive, lacking stability. In this work, we propose image head Masked Contrastive Decoding (MaskCD). Our approach utilizes the "image heads" in LVLMs, masking them to construct contrastive samples for contrastive decoding. We evaluated MaskCD on LLaVA-1.5-7b and Qwen-VL-7b, using various benchmarks such as CHAIR, POPE, AMBER and MME. The results demonstrate that MaskCD effectively alleviates the phenomenon of hallucinations and retains the general capabilities of LVLMs. Corresponding resources could be found at: https://github.com/Deng-Jingyuan/MaskCD .

Jingyuan Deng, Yujiu Yang• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationMS-COCO (POPE Adversarial)
Accuracy86.05
190
Object Hallucination EvaluationMS-COCO POPE (Popular)
Accuracy88.65
158
Object Hallucination EvaluationMS-COCO POPE Random
Accuracy90.05
121
Object Hallucination EvaluationA-OKVQA POPE Popular
Accuracy89.05
76
Object Hallucination EvaluationPOPE GQA Popular
Accuracy86.35
70
Object Hallucination EvaluationA-OKVQA POPE Random
Accuracy90.55
60
Caption Hallucination EvaluationCHAIR--
44
Object Hallucination AssessmentA-OKVQA POPE (Adversarial)
Accuracy0.8275
42
Object Hallucination EvaluationPOPE-GQA Adversarial
Accuracy83.25
34
Multi-modal Hallucination EvaluationAMBER
CHAIR6.6
28
Showing 10 of 11 rows

Other info

Follow for update