To See is to Believe: Prompting GPT-4V for Better Visual Instruction Tuning
About
Existing visual instruction tuning methods typically prompt large language models with textual descriptions to generate instruction-following data. Despite the promising performance achieved, these descriptions are derived from image annotations, which are oftentimes coarse-grained. Furthermore, the instructions might even contradict the visual content without observing the entire visual context. To address this challenge, we introduce a fine-grained visual instruction dataset, LVIS-Instruct4V, which contains 220K visually aligned and context-aware instructions produced by prompting the powerful GPT-4V with images from LVIS. Through experimental validation and case studies, we demonstrate that high-quality visual instructional data could improve the performance of LLaVA-1.5, a state-of-the-art large multimodal model, across a wide spectrum of benchmarks by clear margins. Notably, by simply replacing the LLaVA-Instruct with our LVIS-Instruct4V, we achieve better results than LLaVA on most challenging LMM benchmarks, e.g., LLaVA$^w$ (76.7 vs. 70.7) and MM-Vet (40.2 vs. 35.4). We release our data and model at https://github.com/X2FD/LVIS-INSTRUCT4V.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy80.7 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy62.5 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy57.2 | 1043 | |
| Visual Question Answering | GQA | Accuracy63.8 | 963 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy80.7 | 664 | |
| Multimodal Perception | MME Perception | -- | 61 | |
| Multimodal Understanding | MMBench (dev) | Accuracy68 | 58 | |
| Science Question Answering | ScienceQA IMG (test) | Accuracy70.6 | 45 | |
| Multimodal Cognition | MME Cognition | Cognition Score286.8 | 34 | |
| Visual Question Answering | GQA balanced (test-dev) | Accuracy63.6 | 32 |