Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

iGVLM: Dynamic Instruction-Guided Vision Encoding for Question-Aware Multimodal Understanding

About

Despite the success of Large Vision--Language Models (LVLMs), most existing architectures suffer from a representation bottleneck: they rely on static, instruction-agnostic vision encoders whose visual representations are utilized in an invariant manner across different textual tasks. This rigidity hinders fine-grained reasoning where task-specific visual cues are critical. To address this issue, we propose iGVLM, a general framework for instruction-guided visual modulation. iGVLM introduces a decoupled dual-branch architecture: a frozen representation branch that preserves task-agnostic visual representations learned during pre-training, and a dynamic conditioning branch that performs affine feature modulation via Adaptive Layer Normalization (AdaLN). This design enables a smooth transition from general-purpose perception to instruction-aware reasoning while maintaining the structural integrity and stability of pre-trained visual priors. Beyond standard benchmarks, we introduce MM4, a controlled diagnostic probe for quantifying logical consistency under multi-query, multi-instruction settings. Extensive results show that iGVLM consistently enhances instruction sensitivity across diverse language backbones, offering a plug-and-play paradigm for bridging passive perception and active reasoning.

Hanpeng Liu, Yaqian Li, Zidan Wang, Shuoxi Zhang, Zihao Bo, Rinyoichi Takezoe, Kaiwen Long, Kun He• 2026

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy55.3
1525
Object Hallucination EvaluationPOPE
Accuracy86.1
1455
Visual Question AnsweringVQA v2
Accuracy80.2
1362
Science Question AnsweringScienceQA (SQA)
Accuracy73
273
Multi-instruction Visual ReasoningMM4
Score164
44
Multimodal UnderstandingMMStar
Average Score36.4
31
Visual Question AnsweringGQA
Accuracy63.3
30
Showing 7 of 7 rows

Other info

Follow for update