Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

SVIT: Scaling up Visual Instruction Tuning

About

Thanks to the emerging of foundation models, the large language and vision models are integrated to acquire the multimodal ability of visual captioning, question answering, etc. Although existing multimodal models present impressive performance of visual understanding and reasoning, their limits are still largely under-explored due to the scarcity of high-quality instruction tuning data. To push the limits of multimodal capability, we Scale up Visual Instruction Tuning (SVIT) by constructing a dataset of 4.2 million visual instruction tuning data including 1.6M conversation question-answer (QA) pairs, 1.6M complex reasoning QA pairs, 1.0M referring QA pairs and 106K detailed image descriptions. Besides the volume, the proposed dataset is also featured by the high quality and rich diversity, which is generated by prompting GPT-4 with the abundant manual annotations of images. We also propose a new data recipe to select subset with better diversity and balance, which evokes model's superior capabilities. Extensive experiments verify that SVIT-v1.5, trained on the proposed dataset, outperforms state-of-the-art Multimodal Large Language Models on popular benchmarks. The data and code are publicly available at https://github.com/BAAI-DCAI/Visual-Instruction-Tuning.

Bo Zhao, Boya Wu, Muyang He, Tiejun Huang• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy80.3
664
Multimodal UnderstandingMMMU (val)
MMMU Score38
111
Multimodal UnderstandingMMMU (test)
MMMU Score34.1
86
Multimodal UnderstandingMMBench (test)--
65
Multimodal PerceptionMME Perception--
61
Science Question AnsweringScienceQA IMG (test)
Accuracy70
45
Multimodal CognitionMME Cognition
Cognition Score323.2
34
Visual Question AnsweringGQA balanced (test-dev)
Accuracy64.1
32
Multimodal UnderstandingSEED-Bench 1--
15
Showing 9 of 9 rows

Other info

Follow for update