Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Reallocating Attention Across Layers to Reduce Multimodal Hallucination

About

Multimodal large reasoning models (MLRMs) often suffer from hallucinations that stem not only from insufficient visual grounding but also from imbalanced allocation between perception and reasoning processes. Building upon recent interpretability findings suggesting a staged division of attention across layers, we analyze how this functional misalignment leads to two complementary failure modes: perceptual bias in shallow layers and reasoning drift in deeper layers. To alleviate these issues, we propose Functional Head Identification and Class-Conditioned Rescaling , a lightweight, training-free plugin that identifies perception- and reasoning-oriented heads and adaptively rebalances their layerwise contributions. Our method improves reasoning consistency and visual faithfulness without retraining or any architectural modification. Evaluations across three representative MLRMs and five multimodal reasoning benchmarks show an average 4.2% point gain, with less than 1% additional computation and only 9% baseline latency. Beyond empirical improvements, our study provides an interpretable perspective on regulating cross-layer functional dynamics to enhance the reliability of multimodal reasoning.

Haolang Lu, Bolun Chu, WeiYe Fu, Guoshun Nan, Junning Liu, Minghui Pan, Qiankun Li, Yi Yu, Hua Wang, Kun Wang• 2025

Related benchmarks

TaskDatasetResultRank
Mathematical ReasoningMathVista mini
Accuracy69.78
102
Multimodal EvaluationSEED-Bench
Accuracy69.74
95
Mathematics ReasoningMathVision Mini
Accuracy60.54
15
Multimodal IntegrationMMStar
Accuracy66.49
15
Visual ReasoningHallusionBench
Accuracy68.19
15
Showing 5 of 5 rows

Other info

Follow for update