Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception

About

While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.

Jilong Zhu, Yang Feng• 2026

Related benchmarks

TaskDatasetResultRank
Referring Expression ComprehensionRefCOCO (testA)--
342
Science Question AnsweringScienceQA (SQA)
Accuracy73.5
273
Visual Question AnsweringAI2D
Accuracy64
249
Referring Expression ComprehensionRefCOCO (testB)--
205
Visual Question AnsweringGQA
Mean Accuracy62.8
196
Multimodal UnderstandingSEED-Bench Image
Accuracy70.8
121
Visual Question AnsweringVQA v2
Accuracy79.7
101
Multimodal UnderstandingMME
Score1.55e+3
83
Multimodal UnderstandingLLaVA-Bench
Overall Score66.4
72
OCR Visual Question AnsweringTextVQA
Accuracy59.9
45
Showing 10 of 14 rows

Other info

Follow for update