Improved Baselines with Visual Instruction Tuning

About

Large multimodal models (LMM) have recently shown encouraging progress with visual instruction tuning. In this note, we show that the fully-connected vision-language cross-modal connector in LLaVA is surprisingly powerful and data-efficient. With simple modifications to LLaVA, namely, using CLIP-ViT-L-336px with an MLP projection and adding academic-task-oriented VQA data with simple response formatting prompts, we establish stronger baselines that achieve state-of-the-art across 11 benchmarks. Our final 13B checkpoint uses merely 1.2M publicly available data, and finishes full training in ~1 day on a single 8-A100 node. We hope this can make state-of-the-art LMM research more accessible. Code and model will be publicly available.

Haotian Liu, Chunyuan Li, Yuheng Li, Yong Jae Lee• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.8	2019
Visual Question Answering	VizWiz	Accuracy68.6	1820
Visual Question Answering	TextVQA	Accuracy61.5	1453
Visual Question Answering	VQA v2	Accuracy100	1429
Visual Question Answering	GQA	Accuracy72.6	1425
Text-based Visual Question Answering	TextVQA	Accuracy78.2	962
Multimodal Understanding	MMBench	Accuracy74.4	847
Science Question Answering	ScienceQA	Accuracy74.1	791
Multimodal Evaluation	MME	Score1.86e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy80	712

Showing 10 of 1239 rows

...

Other info

Code

Follow for update

@wizwand_team Discord