CODE: Contrasting Self-generated Description to Combat Hallucination in Large Multi-modal Models

About

Large Multi-modal Models (LMMs) have recently demonstrated remarkable abilities in visual context understanding and coherent response generation. However, alongside these advancements, the issue of hallucinations has emerged as a significant challenge, producing erroneous responses that are unrelated to the visual contents. In this paper, we introduce a novel contrastive-based decoding method, COuntering DEscription Contrastive Decoding (CODE), which leverages self-generated descriptions as contrasting references during the decoding phase of LMMs to address hallucination issues. CODE utilizes the comprehensive descriptions from model itself as visual counterpart to correct and improve response alignment with actual visual content. By dynamically adjusting the information flow and distribution of next-token predictions in the LMM's vocabulary, CODE enhances the coherence and informativeness of generated responses. Extensive experiments demonstrate that our method significantly reduces hallucinations and improves cross-modal consistency across various benchmarks and cutting-edge LMMs. Our method provides a simple yet effective decoding strategy that can be integrated to existing LMM frameworks without additional training.

Junho Kim, Hyunjun Kim, Yeonju Kim, Yong Man Ro• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Hallucination Evaluation	CHAIR	CHAIR_s45.2	393
Visual Question Answering	VQA v2	Accuracy66	347
Hallucination Evaluation	MMHal-Bench	MMHal Score2.05	309
Object Hallucination Evaluation	CHAIR	CHAIRi Score13.8	174
Multi-modal Question Answering	MMMU	Accuracy52.41	98
Object Hallucination Evaluation	MSCOCO 2014 (val)	CHAIRs50.76	81
Object Hallucination	POPE	F1 Score85.06	79
Object Probing	POPE (average)	Accuracy87.31	52
Hallucination assessment	AMBER (test)	CHAIR9	44

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord