Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SECOND: Mitigating Perceptual Hallucination in Vision-Language Models via Selective and Contrastive Decoding

About

Despite significant advancements in Vision-Language Models (VLMs), the performance of existing VLMs remains hindered by object hallucination, a critical challenge to achieving accurate visual understanding. To address this issue, we propose SECOND: Selective and Contrastive Decoding, a novel approach that enables VLMs to effectively leverage multi-scale visual information with an object-centric manner, closely aligning with human visual perception. SECOND progressively selects and integrates multi-scale visual information, facilitating a more precise interpretation of images. By contrasting these visual information iteratively, SECOND significantly reduces perceptual hallucinations and outperforms a wide range of benchmarks. Our theoretical analysis and experiments highlight the largely unexplored potential of multi-scale application in VLMs, showing that prioritizing and contrasting across scales outperforms existing methods.

Woohyeon Park, Woojin Kim, Jaeik Kim, Jaeyoung Do• 2025

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationMS-COCO POPE (Popular)
Accuracy89.7
108
Object Hallucination EvaluationA-OKVQA POPE Popular
Accuracy90.3
52
Object Hallucination EvaluationPOPE GQA Popular
Accuracy89.4
46
Vision-Language Perception and ReasoningMMStar
Accuracy (MMStar)39.9
16
Vision-Language Perception and ReasoningMMBench lite
Accuracy84.8
16
Visual Question AnsweringVQA lite v2
Accuracy77.5
16
Showing 6 of 6 rows

Other info

Follow for update