Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

ShareGPT4V: Improving Large Multi-Modal Models with Better Captions

About

In the realm of large multi-modal models (LMMs), efficient modality alignment is crucial yet often constrained by the scarcity of high-quality image-text data. To address this bottleneck, we introduce the ShareGPT4V dataset, a pioneering large-scale resource featuring 1.2 million highly descriptive captions, which surpasses existing datasets in diversity and information content, covering world knowledge, object properties, spatial relationships, and aesthetic evaluations. Specifically, ShareGPT4V originates from a curated 100K high-quality captions collected from advanced GPT4-Vision and has been expanded to 1.2M with a superb caption model trained on this subset. ShareGPT4V first demonstrates its effectiveness for the Supervised Fine-Tuning (SFT) phase, by substituting an equivalent quantity of detailed captions in existing SFT datasets with a subset of our high-quality captions, significantly enhancing the LMMs like LLaVA-7B, LLaVA-1.5-13B, and Qwen-VL-Chat-7B on the MME and MMBench benchmarks, with respective gains of 222.8/22.0/22.3 and 2.7/1.3/1.5. We further incorporate ShareGPT4V data into both the pre-training and SFT phases, obtaining ShareGPT4V-7B, a superior LMM based on a simple architecture that has remarkable performance across a majority of the multi-modal benchmarks. This project is available at https://ShareGPT4V.github.io to serve as a pivotal resource for advancing the LMMs community.

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, Dahua Lin• 2023

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy86.6
2019
Visual Question AnsweringVizWiz
Accuracy57.2
1820
Visual Question AnsweringGQA
Accuracy63.3
1425
Text-based Visual Question AnsweringTextVQA
Accuracy60.4
962
Multimodal UnderstandingMMBench
Accuracy68.8
847
Multimodal EvaluationMME
Score1.94e+3
727
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy81
712
Multimodal UnderstandingMM-Vet
MM-Vet Score67.5
631
Multimodal ReasoningMM-Vet
MM-Vet Score45.9
517
Multimodal UnderstandingSEED-Bench--
516
Showing 10 of 152 rows
...

Other info

Code

Follow for update