iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding
About
Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VizWiz | Accuracy55.3 | 1525 | |
| Object Hallucination Evaluation | POPE | Accuracy86.1 | 1455 | |
| Visual Question Answering | VQA v2 | Accuracy80.2 | 1362 | |
| Science Question Answering | ScienceQA (SQA) | Accuracy73 | 273 | |
| Multi-instruction Visual Reasoning | MM4 | Score164 | 44 | |
| Multimodal Understanding | MMStar | Average Score36.4 | 31 | |
| Visual Question Answering | GQA | Accuracy63.3 | 30 |