Dense Connector for MLLMs

About

Do we fully leverage the potential of visual encoder in Multimodal Large Language Models (MLLMs)? The recent outstanding performance of MLLMs in multimodal understanding has garnered broad attention from both academia and industry. In the current MLLM rat race, the focus seems to be predominantly on the linguistic side. We witness the rise of larger and higher-quality instruction datasets, as well as the involvement of larger-sized LLMs. Yet, scant attention has been directed towards the visual signals utilized by MLLMs, often assumed to be the final high-level features extracted by a frozen visual encoder. In this paper, we introduce the Dense Connector - a simple, effective, and plug-and-play vision-language connector that significantly enhances existing MLLMs by leveraging multi-layer visual features, with minimal additional computational overhead. Building on this, we also propose the Efficient Dense Connector, which achieves performance comparable to LLaVA-v1.5 with only 25% of the visual tokens. Furthermore, our model, trained solely on images, showcases remarkable zero-shot capabilities in video understanding as well. Experimental results across various vision encoders, image resolutions, training dataset scales, varying sizes of LLMs (2.7B->70B), and diverse architectures of MLLMs (e.g., LLaVA-v1.5, LLaVA-NeXT and Mini-Gemini) validate the versatility and scalability of our approach, achieving state-of-the-art performance across 19 image and video benchmarks. We hope that this work will provide valuable experience and serve as a basic module for future MLLM development. Code is available at https://github.com/HJYao00/DenseConnector .

Huanjin Yao, Wenhao Wu, Taojiannan Yang, YuXin Song, Mengxi Zhang, Haocheng Feng, Yifan Sun, Zhiheng Li, Wanli Ouyang, Jingdong Wang• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy79.4	712
Visual Question Answering	ScienceQA	Accuracy69.5	446
Science Question Answering	ScienceQA (SQA)	Accuracy69.5	273
Visual Question Answering	GQA (test-dev)	Accuracy62.8	236
Hallucination Evaluation	POPE	Accuracy86.6	217
Multi-discipline Multimodal Understanding	MMMU (val)	--	212
Visual Question Answering	GQA (test)	Accuracy66.6	197
Visual Question Answering	GQA	Mean Accuracy63.8	196
Mathematical Reasoning	MathVista (testmini)	Accuracy25.8	121
Multimodal Understanding	MM-VET (test)	Total Score59.2	120

Showing 10 of 25 rows

Other info

Code

Follow for update

@wizwand_team Discord