Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Improved Baselines with Visual Instruction Tuning

About

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy80.3
1165
Visual Question AnsweringTextVQA
Accuracy61.5
1117
Visual Question AnsweringVizWiz
Accuracy63.11
1043
Visual Question AnsweringGQA
Accuracy72
963
Object Hallucination EvaluationPOPE
Accuracy86.8
935
Image CaptioningMS COCO Karpathy (test)--
682
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy80
664
Multimodal EvaluationMME
Score1.86e+3
557
Text-based Visual Question AnsweringTextVQA
Accuracy78.2
496
Image ClassificationFlowers102
Accuracy6.7
478
Showing 10 of 743 rows
...

Other info

Code

Follow for update