InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
About
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy78.9 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy50.7 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy34.5 | 1043 | |
| Visual Question Answering | GQA | Accuracy50.7 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy85.3 | 935 | |
| Image Captioning | MS COCO Karpathy (test) | CIDEr0.824 | 682 | |
| Multimodal Evaluation | MME | Score1.30e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy50.7 | 496 | |
| Video Question Answering | MSRVTT-QA | Accuracy41.8 | 481 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score36 | 418 |