SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

About

Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

Woohyeon Park, Woojin Kim, Jaeik Kim, Jaeyoung Do• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	MS-COCO POPE (Popular)	Accuracy89.7	158
Object Hallucination Evaluation	A-OKVQA POPE Popular	Accuracy90.3	76
Object Hallucination Evaluation	POPE GQA Popular	Accuracy89.4	70
Vision-Language Perception and Reasoning	MMStar	Accuracy (MMStar)39.9	23
Vision-Language Perception and Reasoning	MMBench lite	Accuracy84.8	16
Visual Question Answering	VQA lite v2	Accuracy77.5	16

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord