Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

NVLM: Open Frontier-Class Multimodal LLMs

About

We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs) that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary models (e.g., GPT-4o) and open-access models (e.g., Llama 3-V 405B and InternVL 2). Remarkably, NVLM 1.0 shows improved text-only performance over its LLM backbone after multimodal training. In terms of model design, we perform a comprehensive comparison between decoder-only multimodal LLMs (e.g., LLaVA) and cross-attention-based models (e.g., Flamingo). Based on the strengths and weaknesses of both approaches, we propose a novel architecture that enhances both training efficiency and multimodal reasoning capabilities. Furthermore, we introduce a 1-D tile-tagging design for tile-based dynamic high-resolution images, which significantly boosts performance on multimodal reasoning and OCR-related tasks. Regarding training data, we meticulously curate and provide detailed information on our multimodal pretraining and supervised fine-tuning datasets. Our findings indicate that dataset quality and task diversity are more important than scale, even during the pretraining phase, across all architectures. Notably, we develop production-grade multimodality for the NVLM-1.0 models, enabling them to excel in vision-language tasks while maintaining and even improving text-only performance compared to their LLM backbones. To achieve this, we craft and integrate a high-quality text-only dataset into multimodal training, alongside a substantial amount of multimodal math and reasoning data, leading to enhanced math and coding capabilities across modalities. To advance research in the field, we release the model weights at https://huggingface.co/nvidia/NVLM-D-72B and will open-source the training code for the community soon.

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuolin Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, Wei Ping• 2024

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMM-Vet
MM-Vet Score58.9
531
Multimodal UnderstandingMMMU
Accuracy60.8
437
Multimodal ReasoningMM-Vet
MM-Vet Score58.9
431
Mathematical ReasoningMathVista
Score63.9
385
Chart Question AnsweringChartQA
Accuracy86
356
Multimodal Capability EvaluationMM-Vet
Score58.9
345
Multimodal UnderstandingMMStar
Accuracy63.7
324
Multi-discipline Multimodal UnderstandingMMMU
Accuracy59.7
317
Visual Mathematical ReasoningMathVista
Accuracy66.6
278
Text-based Visual Question AnsweringTextVQA (val)
Accuracy82.1
262
Showing 10 of 41 rows

Other info

Follow for update