Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

About

This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. We release 4B and 14B models, including both the pre-trained base model and the instruction fine-tuned ones. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our models demonstrate competitive performance among open-source LMMs with similar model sizes. Our resulting LMMs demonstrate competitive performance among open-source LMMs with similar model sizes, with the ability to comprehend interleaved image-text inputs. Our training code, models, and all datasets used in this work, including the three largescale datasets we create and the preprocessed ones, will be open-sourced to better support the research community.

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Shaoyen Tseng, Gustavo A Lujan-Moreno, Matthew L Olson, Musashi Hinck, David Cobbley, Vasudev Lal, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87
1455
Mathematical ReasoningMathVista
Score39.6
385
Visual Question AnsweringTextVQA (val)
VQA Score71
343
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy81.5
337
Multi-discipline Multimodal UnderstandingMMMU--
317
Science Question AnsweringScienceQA (test)
Average Accuracy88.3
245
Multi-discipline Multimodal UnderstandingMMMU (val)
Accuracy41.1
204
Multimodal UnderstandingSEED
Accuracy72.2
183
Vision UnderstandingMMBench--
141
Mathematical ReasoningMathVista (testmini)
Accuracy39.6
103
Showing 10 of 25 rows

Other info

Follow for update