InternLM-XComposer2: Mastering Free-form Text-Image Composition and Comprehension in Vision-Language Large Model
About
We introduce InternLM-XComposer2, a cutting-edge vision-language model excelling in free-form text-image composition and comprehension. This model goes beyond conventional vision-language understanding, adeptly crafting interleaved text-image content from diverse inputs like outlines, detailed textual specifications, and reference images, enabling highly customizable content creation. InternLM-XComposer2 proposes a Partial LoRA (PLoRA) approach that applies additional LoRA parameters exclusively to image tokens to preserve the integrity of pre-trained language knowledge, striking a balance between precise vision understanding and text composition with literary talent. Experimental results demonstrate the superiority of InternLM-XComposer2 based on InternLM2-7B in producing high-quality long-text multi-modal content and its exceptional vision-language understanding performance across various benchmarks, where it not only significantly outperforms existing multimodal models but also matches or even surpasses GPT-4V and Gemini Pro in certain assessments. This highlights its remarkable proficiency in the realm of multimodal understanding. The InternLM-XComposer2 model series with 7B parameters are publicly available at https://github.com/InternLM/InternLM-XComposer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Text-based Visual Question Answering | TextVQA | Accuracy62.2 | 962 | |
| Multimodal Understanding | MMBench | Accuracy79.6 | 847 | |
| Science Question Answering | ScienceQA | Accuracy78.3 | 791 | |
| Multimodal Evaluation | MME | -- | 727 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score51.2 | 631 | |
| Multimodal Understanding | SEED-Bench | Accuracy68.9 | 516 | |
| Mathematical Reasoning | MathVista | Score59.5 | 474 | |
| Multimodal Understanding | MMMU | Accuracy56.48 | 437 | |
| Multimodal Understanding | MMStar | Accuracy47.7 | 407 | |
| Diagram Question Answering | AI2D | AI2D Accuracy80.27 | 387 |