NVLM: Open Frontier-Class Multimodal LLMs
About
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we release the model weights at https://huggingface.co/nvidia/NVLM-D-72B and will open-source the training code for the community soon.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MM-Vet | MM-Vet Score58.9 | 418 | |
| Mathematical Reasoning | MathVista | Score63.9 | 322 | |
| Multimodal Capability Evaluation | MM-Vet | Score58.9 | 282 | |
| Multimodal Understanding | MMMU | Accuracy60.8 | 275 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy59.7 | 266 | |
| Chart Question Answering | ChartQA | Accuracy86 | 229 | |
| Multimodal Understanding | MMStar | Accuracy63.7 | 197 | |
| Diagram Question Answering | AI2D | AI2D Accuracy85.2 | 196 | |
| Visual Mathematical Reasoning | MathVista | Accuracy66.6 | 189 | |
| Diagram Understanding | AI2D | Accuracy80.1 | 167 |