An Empirical Study of Scaling Instruct-Tuned Large Multimodal Models
About
Visual instruction tuning has recently shown encouraging progress with open-source large multimodal models (LMM) such as LLaVA and MiniGPT-4. However, most existing studies of open-source LMM are performed using models with 13B parameters or smaller. In this paper we present an empirical study of scaling LLaVA up to 33B and 65B/70B, and share our findings from our explorations in image resolution, data mixing and parameter-efficient training methods such as LoRA/QLoRA. These are evaluated by their impact on the multi-modal and language capabilities when completing real-world tasks in the wild. We find that scaling LMM consistently enhances model performance and improves language capabilities, and performance of LoRA/QLoRA tuning of LMM are comparable to the performance of full-model fine-tuning. Additionally, the study highlights the importance of higher image resolutions and mixing multimodal-language data to improve LMM performance, and visual instruction tuning can sometimes improve LMM's pure language capability. We hope that this study makes state-of-the-art LMM research at a larger scale more accessible, thus helping establish stronger baselines for future research. Code and checkpoints will be made public.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Multimodal Understanding | MM-VET (test) | Total Score36.4 | 114 | |
| Vision-Language Understanding | MM-Vet | Total Score35.5 | 43 | |
| Language Understanding | MMLU v1 (test) | Accuracy65.1 | 15 | |
| Language Instruction Following | Vicuna-80 v1 (test) | Score85.3 | 10 | |
| Open-ended Visual Chat | LLaVA-Bench In-the-Wild (full) | Reasoning Score88.7 | 8 | |
| Multimodal Understanding | LLaVA-Bench v1 (test) | Score74.2 | 6 | |
| Multimodal Understanding | MM-VET v1 (test) | Score36.4 | 6 |