Improved Baselines with Visual Instruction Tuning
About
Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.
Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee• 2023
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy80.3 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy61.5 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy63.11 | 1043 | |
| Visual Question Answering | GQA | Accuracy72 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy86.8 | 935 | |
| Image Captioning | MS COCO Karpathy (test) | -- | 682 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy80 | 664 | |
| Multimodal Evaluation | MME | Score1.86e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy78.2 | 496 | |
| Image Classification | Flowers102 | Accuracy6.7 | 478 |
Showing 10 of 743 rows
...