InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

About

Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd

Haogeng Liu, Quanzeng You, Xiaotian Han, Yiqi Wang, Bohan Zhai, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang• 2024

Related benchmarks

Task	Dataset	Result
Visual Question Answering	VQA v2	Accuracy82	1429
Visual Question Answering	GQA	Accuracy63.5	1425
Multimodal Evaluation	MME	--	727
Multimodal Reasoning	MM-Vet	MM-Vet Score38.9	517
Visual Question Answering	OKVQA	Top-1 Accuracy65.5	283
Multi-discipline Multimodal Understanding	MMMU (val)	--	212
Visual Question Answering	TextVQA (test)	Accuracy70.7	124
Visual Question Answering	OCR-VQA (test)	Accuracy66	77
Visual Question Answering	SciQA-IMG	Accuracy83.6	71
Multimodal Understanding	MMBench (dev)	--	58

Showing 10 of 14 rows

Other info

Code

Follow for update

@wizwand_team Discord