Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding

About

Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd

Haogeng Liu, Quanzeng You, Xiaotian Han, Yiqi Wang, Bohan Zhai, Yongfei Liu, Yunzhe Tao, Huaibo Huang, Ran He, Hongxia Yang• 2024

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy82
1165
Visual Question AnsweringGQA
Accuracy63.5
963
Multimodal EvaluationMME--
557
Visual Question AnsweringOKVQA
Top-1 Accuracy65.5
283
Multimodal ReasoningMM-Vet
MM-Vet Score38.9
281
Multi-discipline Multimodal UnderstandingMMMU (val)--
167
Visual Question AnsweringTextVQA (test)
Accuracy70.7
124
Visual Question AnsweringOCR-VQA (test)
Accuracy66
77
Multimodal UnderstandingMMBench (dev)--
58
Visual Question AnsweringSciQA-IMG
Accuracy83.6
53
Showing 10 of 14 rows

Other info

Code

Follow for update