Breaking the Illusion: When Positive Meets Negative in Multimodal Decoding

About

Vision-Language Models (VLMs) are frequently undermined by object hallucination, generating content that contradicts visual reality, due to an over-reliance on linguistic priors. We introduce Positive-and-Negative Decoding (PND), a training-free inference framework that intervenes directly in the decoding process to enforce visual fidelity. PND is motivated by our finding of an attention imbalance in VLMs, where visual features are under-weighted. Our framework introduces a dual-path contrast: a positive path that amplifies visual evidence and a negative path that constructs counterfactuals to penalize prior-dominant generation. By contrasting outputs from both paths during decoding, PND steers generation toward visually grounded results. Experiments on POPE, MME, and CHAIR demonstrate state-of-the-art performance without retraining.

Yubo Jiang, Yitong An, Xin Yang, Abudukelimu Wuerkaixi, Xuxin Cheng, Fengying Xie, Zhiguo Jiang, Cao Liu, Ke Zeng, Haopeng Zhang• 2026

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	MS-COCO (POPE Adversarial)	Accuracy87.26	205
Object Hallucination Evaluation	MS-COCO POPE (Popular)	Accuracy86.1	158
Object Hallucination Evaluation	MS-COCO POPE Random	Accuracy87.63	121
Image Captioning	CHAIR	CHAIR_S46	71
Multimodal Perception	MME	Total Score668.3	42

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord