InfiMM-HD: A Leap Forward in High-Resolution Multimodal Understanding
About
Multimodal Large Language Models (MLLMs) have experienced significant advancements recently. Nevertheless, challenges persist in the accurate recognition and comprehension of intricate details within high-resolution images. Despite being indispensable for the development of robust MLLMs, this area remains underinvestigated. To tackle this challenge, our work introduces InfiMM-HD, a novel architecture specifically designed for processing images of different resolutions with low computational overhead. This innovation facilitates the enlargement of MLLMs to higher-resolution capabilities. InfiMM-HD incorporates a cross-attention module and visual windows to reduce computation costs. By integrating this architectural design with a four-stage training pipeline, our model attains improved visual perception efficiently and cost-effectively. Empirical study underscores the robustness and effectiveness of InfiMM-HD, opening new avenues for exploration in related areas. Codes and models can be found at https://huggingface.co/Infi-MM/infimm-hd
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy82 | 1165 | |
| Visual Question Answering | GQA | Accuracy63.5 | 963 | |
| Multimodal Evaluation | MME | -- | 557 | |
| Visual Question Answering | OKVQA | Top-1 Accuracy65.5 | 283 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score38.9 | 281 | |
| Multi-discipline Multimodal Understanding | MMMU (val) | -- | 167 | |
| Visual Question Answering | TextVQA (test) | Accuracy70.7 | 124 | |
| Visual Question Answering | OCR-VQA (test) | Accuracy66 | 77 | |
| Multimodal Understanding | MMBench (dev) | -- | 58 | |
| Visual Question Answering | SciQA-IMG | Accuracy83.6 | 53 |