Mask What Matters: Mitigating Object Hallucinations in Multimodal Large Language Models with Object-Aligned Visual Contrastive Decoding

About

We study object hallucination in Multimodal Large Language Models (MLLMs) and improve visual contrastive decoding (VCD) by constructing an object-aligned auxiliary view. We leverage object-centric attention in self-supervised Vision Transformers. In particular, we remove the most salient visual evidence to construct an auxiliary view that disrupts unsupported tokens and produces a stronger contrast signal. Our method is prompt-agnostic, model-agnostic, and can be seamlessly plugged into the existing VCD pipeline with little computation overhead, i.e., a single cacheable forward pass. Empirically, our method demonstrates consistent gains on two popular object hallucination benchmarks across two MLLMs.

Boqi Chen, Xudong Liu, Jianing Qiu• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination	POPE Popular	F1 Score84.3	372
Object Hallucination	POPE Adversarial	Accuracy82.9	353
Object Hallucination	POPE (Random)	F1 Score86.5	324
Discriminative Object Hallucination	POPE MSCOCO Adversarial	Accuracy81.9	43
Hallucination Evaluation	POPE Random v1.0 (test)	Accuracy89.5	31
Hallucination Evaluation	POPE Popular v1.0 (test)	Accuracy85.7	31
Hallucination Evaluation	POPE Adversarial v1.0 (test)	Accuracy81.9	31
Discriminative Object Hallucination	POPE MSCOCO (Random)	Accuracy89.5	29
Discriminative Object Hallucination	POPE MSCOCO Popular	Accuracy85.7	29

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord