NVLM: Open Frontier-Class Multimodal LLMs
About
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we release the model weights at https://huggingface.co/nvidia/NVLM-D-72B and will open-source the training code for the community soon.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MM-Vet | MM-Vet Score58.9 | 531 | |
| Multimodal Understanding | MMMU | Accuracy60.8 | 437 | |
| Multimodal Reasoning | MM-Vet | MM-Vet Score58.9 | 431 | |
| Mathematical Reasoning | MathVista | Score63.9 | 385 | |
| Chart Question Answering | ChartQA | Accuracy86 | 356 | |
| Multimodal Capability Evaluation | MM-Vet | Score58.9 | 345 | |
| Multimodal Understanding | MMStar | Accuracy63.7 | 324 | |
| Multi-discipline Multimodal Understanding | MMMU | Accuracy59.7 | 317 | |
| Visual Mathematical Reasoning | MathVista | Accuracy66.6 | 278 | |
| Text-based Visual Question Answering | TextVQA (val) | Accuracy82.1 | 262 |