MMFuser: Multimodal Multi-Layer Feature Fuser for Fine-Grained Vision-Language Understanding

About

Despite significant advancements in Multimodal Large Language Models (MLLMs) for understanding complex human intentions through cross-modal interactions, capturing intricate image details remains challenging. Previous methods integrating multiple vision encoders to enhance visual detail introduce redundancy and computational overhead. We observe that most MLLMs utilize only the last-layer feature map of the vision encoder for visual representation, neglecting the rich fine-grained information in shallow feature maps. To address this issue, we propose \modelname, a simple yet effective multi-layer feature fuser that efficiently integrates deep and shallow features from Vision Transformers (ViTs). Specifically, it leverages semantically aligned deep features as queries to dynamically extract missing details from shallow features, thus preserving semantic alignment while enriching the representation with fine-grained information. Applied to the LLaVA-1.5 model, \modelname~achieves significant improvements in visual representation and benchmark performance, providing a more flexible and lightweight solution compared to multi-encoder ensemble methods. The code and model have been released at https://github.com/yuecao0119/MMFuser.

Yue Cao, Yangzhou Liu, Zhe Chen, Guangchen Shi, Wenhai Wang, Danhuai Zhao, Tong Lu• 2024

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy87.5	2019
Visual Question Answering	VizWiz	Accuracy57.4	1820
Visual Question Answering	TextVQA	Accuracy59.9	1453
Visual Question Answering	GQA	Accuracy63.4	1425
Multimodal Evaluation	MME	Score1.59e+3	727
Visual Question Answering	ScienceQA	Accuracy68.7	446
Multimodal Capability Evaluation	MM-Vet	Score36.6	393
Referring Expression Comprehension	RefCOCO+ (val)	--	354
Referring Expression Comprehension	RefCOCO (val)	--	348
Referring Expression Comprehension	RefCOCO (testA)	--	346

Showing 10 of 38 rows

Other info

Code

Follow for update

@wizwand_team Discord