Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

About

Large Vision-Language Models (LVLMs) have achieved remarkable success in a wide range of multimodal tasks by integrating pre-trained vision encoders and large language models. However, current LVLMs primarily rely on visual features extracted from the final layers of the vision encoder, overlooking the complementary information available in shallower layers. While recent approaches have explored the use of multilayer visual features in LVLMs, they tend to be task-agnostic and fail to examine the dependencies of hierarchical visual features on specific tasks. To address these gaps, we systematically investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. Building on these insights, we propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions, without increasing the number of visual tokens. Extensive evaluations demonstrate the superior performance of our method. Additionally, an in-depth analysis of the aggregator's behavior highlights the dominance of mid-to-high-level features in semantic-rich tasks and the critical role of low-level features in fine-grained perception.

Xu Li, Yi Zheng, Haotian Chen, Xiaolei Chen, Yuxuan Liang, Chenghang Lai, Bin Li, Xiangyang Xue• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	AI2D	Accuracy57	402
Science Question Answering	ScienceQA (SQA)	Accuracy70.2	338
Visual Question Answering	GQA	Mean Accuracy63.1	196
Multimodal Understanding	MME	Score1.52e+3	150
Multimodal Understanding	SEED-Bench Image	Accuracy68.3	143
OCR Visual Question Answering	TextVQA	Accuracy59.4	88
High-Resolution Visual Perception	HR-Bench-4K	Accuracy40	79
Fine-grained Visual Perception	V-Star	Accuracy48.2	20

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord