Self-Correction Inside the Model: Leveraging Layer Attention to Mitigate Hallucinations in Large Vision Language Models

About

Although Large Vision-Language Models (LVLMs) have made substantial progress, hallucination, where generated text is not grounded in the visual input, remains a challenge. As LVLMs become stronger, previously reported hallucination patterns, such as linguistic bias and overthinking phenomenon, become far less consistent, making the corresponding mitigation techniques substantially less effective. In this paper, we introduce an Internal self-Correction mechanism utilizing Layer Attention (ICLA) that operates directly on hidden states during generation. Each layer selectively retrieves information from all preceding layers through a diagonal cross-layer attention mechanism, enabling self-refinement without any external correction signals. With introducing and training only 0.2M and 0.1M additional parameters on LLaVA1.5-7B and Qwen2.5-VL-7B, \ours consistently improves visual grounding across multiple hallucination benchmarks, demonstrating its effectiveness for more advanced LVLMs.

April Fu• 2026

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMMU	Accuracy58.3	437
Multimodal Reasoning	MMMU	Accuracy69.2	208
Multimodal Understanding	MMMU (test)	--	112
Object Hallucination Evaluation	POPE A-OKVQA	Accuracy89.03	75
Object Hallucination Evaluation	POPE MSCOCO	F1 Score89.95	60
Multimodal Evaluation	LLaVA-Bench	--	48
Multimodal Perception	MME	Perception Score1.74e+3	43
Perception	MME total perception score	Total Perception Score1.71e+3	15
Vision-Language Conversation	LLaVA-Bench	Overall Accuracy106.8	7

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord