Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models

About

Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.

Yadong Lu, Chunyuan Li, Haotian Liu, Jianwei Yang, Jianfeng Gao, Yelong Shen• 2023

Related benchmarks

TaskDatasetResultRank
Multimodal UnderstandingMM-VET (test)
Total Score36.4
114
Vision-Language UnderstandingMM-Vet
Total Score35.5
43
Language UnderstandingMMLU v1 (test)
Accuracy65.1
15
Language Instruction FollowingVicuna-80 v1 (test)
Score85.3
10
Open-ended Visual ChatLLaVA-Bench In-the-Wild (full)
Reasoning Score88.7
8
Multimodal UnderstandingLLaVA-Bench v1 (test)
Score74.2
6
Multimodal UnderstandingMM-VET v1 (test)
Score36.4
6
Showing 7 of 7 rows

Other info

Follow for update