Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Improved Baselines with Visual Instruction Tuning

About

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee• 2023

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.8
2019
Visual Question AnsweringVizWiz
Accuracy68.6
1820
Visual Question AnsweringTextVQA
Accuracy61.5
1453
Visual Question AnsweringVQA v2
Accuracy100
1429
Visual Question AnsweringGQA
Accuracy72.6
1425
Text-based Visual Question AnsweringTextVQA
Accuracy78.2
962
Multimodal UnderstandingMMBench
Accuracy74.4
847
Science Question AnsweringScienceQA
Accuracy74.1
791
Multimodal EvaluationMME
Score1.86e+3
727
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy80
712
Showing 10 of 1239 rows
...

Other info

Code

Follow for update