InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
About
The exponential growth of large language models (LLMs) has opened up numerous possibilities for multimodal AGI systems. However, the progress in vision and vision-language foundation models, which are also critical elements of multi-modal AGI, has not kept pace with LLMs. In this work, we design a large-scale vision-language foundation model (InternVL), which scales up the vision foundation model to 6 billion parameters and progressively aligns it with the LLM, using web-scale image-text data from various sources. This model can be broadly applied to and achieve state-of-the-art performance on 32 generic visual-linguistic benchmarks including visual perception tasks such as image-level or pixel-level recognition, vision-language tasks such as zero-shot image/video classification, zero-shot image/video-text retrieval, and link with LLMs to create multi-modal dialogue systems. It has powerful visual capabilities and can be a good alternative to the ViT-22B. We hope that our research could contribute to the development of multi-modal large models. Code and models are available at https://github.com/OpenGVLab/InternVL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy80.2 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy80.2 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy54.6 | 1043 | |
| Visual Question Answering | GQA | Accuracy62.9 | 963 | |
| Semantic segmentation | ADE20K | mIoU68.64 | 936 | |
| Object Hallucination Evaluation | POPE | Accuracy91.1 | 935 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy88.2 | 840 | |
| Image Classification | ImageNet-1K | Top-1 Acc83.2 | 836 | |
| Image Captioning | MS COCO Karpathy (test) | CIDEr1.462 | 682 | |
| Image Classification | CIFAR-100 | Top-1 Accuracy93.1 | 622 |