Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Instruction-Guided Fusion of Multi-Layer Visual Features in Large Vision-Language Models

About

Large Vision-Language Models (LVLMs) have achieved remarkable success in a wide range of multimodal tasks by integrating pre-trained vision encoders and large language models. However, current LVLMs primarily rely on visual features extracted from the final layers of the vision encoder, overlooking the complementary information available in shallower layers. While recent approaches have explored the use of multilayer visual features in LVLMs, they tend to be task-agnostic and fail to examine the dependencies of hierarchical visual features on specific tasks. To address these gaps, we systematically investigate the contributions of visual features from different encoder layers using 18 benchmarks spanning 6 task categories. Our findings reveal that multilayer features provide complementary strengths with varying task dependencies, and uniform fusion leads to suboptimal performance. Building on these insights, we propose the instruction-guided vision aggregator, a module that dynamically integrates multi-layer visual features based on textual instructions, without increasing the number of visual tokens. Extensive evaluations demonstrate the superior performance of our method. Additionally, an in-depth analysis of the aggregator's behavior highlights the dominance of mid-to-high-level features in semantic-rich tasks and the critical role of low-level features in fine-grained perception.

Xu Li, Yi Zheng, Haotian Chen, Xiaolei Chen, Yuxuan Liang, Chenghang Lai, Bin Li, Xiangyang Xue• 2024

Related benchmarks

TaskDatasetResultRank
Science Question AnsweringScienceQA (SQA)
Accuracy70.2
273
Visual Question AnsweringAI2D
Accuracy57
249
Visual Question AnsweringGQA
Mean Accuracy63.1
196
Multimodal UnderstandingSEED-Bench Image
Accuracy68.3
121
Multimodal UnderstandingMME
Score1.52e+3
83
OCR Visual Question AnsweringTextVQA
Accuracy59.4
45
High-Resolution Visual PerceptionHR-Bench-4K
Accuracy40
40
Fine-grained Visual PerceptionV-Star
Accuracy48.2
20
Showing 8 of 8 rows

Other info

Follow for update