xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
About
This paper introduces BLIP-3, an open framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. We release 4B and 14B models, including both the pre-trained base model and the instruction fine-tuned ones. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our models demonstrate competitive performance among open-source LMMs with similar model sizes. Our resulting LMMs demonstrate competitive performance among open-source LMMs with similar model sizes, with the ability to comprehend interleaved image-text inputs. Our training code, models, and all datasets used in this work, including the three largescale datasets we create and the preprocessed ones, will be open-sourced to better support the research community.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Hallucination Evaluation | POPE | Accuracy87 | 935 | |
| Visual Question Answering | VQA 2.0 (test-dev) | Accuracy81.5 | 337 | |
| Visual Question Answering | TextVQA (val) | VQA Score71 | 309 | |
| Multi-discipline Multimodal Understanding | MMMU | -- | 266 | |
| Science Question Answering | ScienceQA (test) | Average Accuracy88.3 | 208 | |
| Multi-discipline Multimodal Understanding | MMMU (val) | Accuracy41.1 | 167 | |
| Multimodal Understanding | SEED | Accuracy72.2 | 136 | |
| Vision Understanding | MMBench | -- | 104 | |
| Visual Question Answering | ChartQA (test) | Accuracy60 | 58 | |
| Visual Question Answering | AI2D (test) | Accuracy74.2 | 54 |