Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

About

This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. We release 4B and 14B models, including both the pre-trained base model and the instruction fine-tuned ones. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our models demonstrate competitive performance among open-source LMMs with similar model sizes. Our resulting LMMs demonstrate competitive performance among open-source LMMs with similar model sizes, with the ability to comprehend interleaved image-text inputs. Our training code, models, and all datasets used in this work, including the three largescale datasets we create and the preprocessed ones, will be open-sourced to better support the research community.

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Shaoyen Tseng, Gustavo A Lujan-Moreno, Matthew L Olson, Musashi Hinck, David Cobbley, Vasudev Lal, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87
935
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy81.5
337
Visual Question AnsweringTextVQA (val)
VQA Score71
309
Multi-discipline Multimodal UnderstandingMMMU--
266
Science Question AnsweringScienceQA (test)
Average Accuracy88.3
208
Multi-discipline Multimodal UnderstandingMMMU (val)
Accuracy41.1
167
Multimodal UnderstandingSEED
Accuracy72.2
136
Vision UnderstandingMMBench--
104
Visual Question AnsweringChartQA (test)
Accuracy60
58
Visual Question AnsweringAI2D (test)
Accuracy74.2
54
Showing 10 of 23 rows

Other info

Follow for update