Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Improved Baselines with Visual Instruction Tuning

About

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy63.11
1525
Object Hallucination EvaluationPOPE
Accuracy87.9
1455
Visual Question AnsweringVQA v2
Accuracy80.3
1362
Visual Question AnsweringTextVQA
Accuracy61.5
1285
Visual Question AnsweringGQA
Accuracy72
1249
Text-based Visual Question AnsweringTextVQA
Accuracy78.2
807
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy80
706
Image CaptioningMS COCO Karpathy (test)--
682
Multimodal EvaluationMME
Score1.86e+3
658
Multimodal UnderstandingMMBench
Accuracy74.4
637
Showing 10 of 999 rows
...

Other info

Code

Follow for update