From Attenuation to Attention: Variational Information Flow Manipulation for Fine-Grained Visual Perception
About
While Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities in general visual understanding, they frequently falter in fine-grained perception tasks that require identifying tiny objects or discerning subtle visual relationships. We attribute this limitation to Visual Attenuation: a phenomenon where sparse fine-grained visual signals are prematurely suppressed or diluted by dominant textual tokens during network propagation, resulting in a "loss of focus" during the deep-level decision-making process. Existing input-centric solutions fail to fundamentally reverse this intrinsic mechanism of information loss. To address this challenge, we propose the Variational Information Flow (VIF) framework. Adopting a probabilistic perspective, VIF leverages a Conditional Variational Autoencoder (CVAE) to model the visual saliency relevant to the question-answer pair as a latent distribution. As a plug-and-play module, VIF can be integrated into existing architectures. Extensive evaluations across diverse benchmarks, covering General VQA, fine-grained perception, and visual grounding, demonstrate that VIF yields competitive improvements over previous methods, validating its effectiveness in enhancing the fine-grained perception of MLLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Referring Expression Comprehension | RefCOCO (testA) | -- | 342 | |
| Science Question Answering | ScienceQA (SQA) | Accuracy73.5 | 273 | |
| Visual Question Answering | AI2D | Accuracy64 | 249 | |
| Referring Expression Comprehension | RefCOCO (testB) | -- | 205 | |
| Visual Question Answering | GQA | Mean Accuracy62.8 | 196 | |
| Multimodal Understanding | SEED-Bench Image | Accuracy70.8 | 121 | |
| Visual Question Answering | VQA v2 | Accuracy79.7 | 101 | |
| Multimodal Understanding | MME | Score1.55e+3 | 83 | |
| Multimodal Understanding | LLaVA-Bench | Overall Score66.4 | 72 | |
| OCR Visual Question Answering | TextVQA | Accuracy59.9 | 45 |