Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning

About

Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA$^w$ (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.

Junke Wang, Lingchen Meng, Zejia Weng, Bo He, Zuxuan Wu, Yu-Gang Jiang• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy57.2
1525
Visual Question AnsweringVQA v2
Accuracy80.7
1362
Visual Question AnsweringTextVQA
Accuracy62.5
1285
Visual Question AnsweringGQA
Accuracy63.8
1249
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy80.7
706
Multimodal PerceptionMME Perception--
79
Science Question AnsweringScienceQA IMG (test)
Accuracy70.6
74
Multimodal UnderstandingMMBench (dev)
Accuracy68
58
Multimodal CognitionMME Cognition
Cognition Score286.8
34
Visual Question AnsweringGQA balanced (test-dev)
Accuracy63.6
32
Showing 10 of 13 rows

Other info

Code

Follow for update