Reallocating Attention Across Layers to Reduce Multimodal Hallucination

About

Multimodal large reasoning models (MLRMs) often suffer from hallucinations that stem not only from insufficient visual grounding but also from imbalanced allocation between perception and reasoning processes. Building upon recent interpretability findings suggesting a staged division of attention across layers, we analyze how this functional misalignment leads to two complementary failure modes: perceptual bias in shallow layers and reasoning drift in deeper layers. To alleviate these issues, we propose Functional Head Identification and Class-Conditioned Rescaling , a lightweight, training-free plugin that identifies perception- and reasoning-oriented heads and adaptively rebalances their layerwise contributions. Our method improves reasoning consistency and visual faithfulness without retraining or any architectural modification. Evaluations across three representative MLRMs and five multimodal reasoning benchmarks show an average 4.2% point gain, with less than 1% additional computation and only 9% baseline latency. Beyond empirical improvements, our study provides an interpretable perspective on regulating cross-layer functional dynamics to enhance the reliability of multimodal reasoning.

Haolang Lu, Bolun Chu, WeiYe Fu, Guoshun Nan, Junning Liu, Minghui Pan, Qiankun Li, Yi Yu, Hua Wang, Kun Wang• 2025

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	MathVista mini	Accuracy69.78	135
Multimodal Evaluation	SEED-Bench	Accuracy69.74	112
Mathematics Reasoning	MathVision Mini	Accuracy60.54	35
Multimodal Integration	MMStar	Accuracy66.49	15
Visual Reasoning	HallusionBench	Accuracy68.19	15

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord