Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

About

This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. We release 4B and 14B models, including both the pre-trained base model and the instruction fine-tuned ones. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our models demonstrate competitive performance among open-source LMMs with similar model sizes. Our resulting LMMs demonstrate competitive performance among open-source LMMs with similar model sizes, with the ability to comprehend interleaved image-text inputs. Our training code, models, and all datasets used in this work, including the three largescale datasets we create and the preprocessed ones, will be open-sourced to better support the research community.

Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Shaoyen Tseng, Gustavo A Lujan-Moreno, Matthew L Olson, Musashi Hinck, David Cobbley, Vasudev Lal, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt, Zeyuan Chen, Silvio Savarese, Juan Carlos Niebles, Caiming Xiong, Ran Xu• 2024

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy87
2019
Mathematical ReasoningMathVista
Score39.6
474
Multimodal Capability EvaluationMM-Vet
Score41
393
Visual Question AnsweringTextVQA (val)
VQA Score71
365
Multi-discipline Multimodal UnderstandingMMMU--
363
Visual Question AnsweringVQA 2.0 (test-dev)
Accuracy81.5
337
Science Question AnsweringScienceQA (test)
Average Accuracy88.3
273
Multimodal UnderstandingSEED
Accuracy72.2
216
Massive Multi-discipline Multimodal UnderstandingMMMU
Accuracy40.1
216
Multi-discipline Multimodal UnderstandingMMMU (val)
Accuracy41.1
212
Showing 10 of 37 rows

Other info

Follow for update