Mitigating Hallucinations via Inter-Layer Consistency Aggregation in Large Vision-Language Models

About

Despite the impressive capabilities of Large Vision-Language Models (LVLMs), they remain susceptible to hallucinations, where generated content is inconsistent with the input image. Existing training-free hallucination mitigation methods often suffer from unstable performance and high sensitivity to hyperparameter settings, which limits their practicality and broader adoption. In this paper, we propose Decoding with Inter-layer Consistency via Layer Aggregation (DCLA), a training-free decoding mechanism that requires no retraining, fine-tuning, or access to external knowledge bases. Specifically, DCLA constructs a dynamic semantic reference by aggregating representations from previous layers and uses it to correct semantically deviated layers, thereby enforcing inter-layer consistency. Experiments across seven LVLMs and multiple benchmarks demonstrate the generality of DCLA: it surpasses standard decoding by 28.58 MME points on LLaVA1.5-7B and 42.6 MME points on Qwen2.5-VL, while improving POPE accuracy by 2.74 percentage points in the strongest setting.

Kai Tang, Jinhao You, Yichen Guo, Yiding Sun, Dongxu Zhang, Wenya Wang, Hanze Li, Tao Luo, Renyuan Li, Xiande Huang, Shanghang Zhang• 2025

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	--	2056
Multimodal Evaluation	MME	--	902
Multimodal Model Evaluation	MMBench	--	265
Multimodal Reasoning	MMBench	--	180
Multimodal Evaluation	MMStar	--	177
Object Hallucination Evaluation	CHAIR	--	174
Multimodal Evaluation	MMBench	MMB^CN Score85.74	146
Visual Question Answering	VizWiz (test)	--	136
Caption Hallucination Evaluation	CHAIR	CS Score37.4	122
Object Hallucination Evaluation	POPE GQA	Accuracy91	86

Showing 10 of 32 rows

Other info

Follow for update

@wizwand_team Discord