VisionLLM v2: An End-to-End Generalist Multimodal Large Language Model for Hundreds of Vision-Language Tasks
About
We present VisionLLM v2, an end-to-end generalist multimodal large model (MLLM) that unifies visual perception, understanding, and generation within a single framework. Unlike traditional MLLMs limited to text output, VisionLLM v2 significantly broadens its application scope. It excels not only in conventional visual question answering (VQA) but also in open-ended, cross-domain vision tasks such as object localization, pose estimation, and image generation and editing. To this end, we propose a new information transmission mechanism termed "super link", as a medium to connect MLLM with task-specific decoders. It not only allows flexible transmission of task information and gradient feedback between the MLLM and multiple downstream decoders but also effectively resolves training conflicts in multi-tasking scenarios. In addition, to support the diverse range of tasks, we carefully collected and combed training data from hundreds of public vision and vision-language tasks. In this way, our model can be joint-trained end-to-end on hundreds of vision language tasks and generalize to these tasks using a set of shared parameters through different user prompts, achieving performance comparable to task-specific models. We believe VisionLLM v2 will offer a new perspective on the generalization of MLLMs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP56.7 | 2454 | |
| Visual Question Answering | TextVQA | Accuracy66.3 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy54.6 | 1043 | |
| Semantic segmentation | ADE20K | mIoU52.3 | 936 | |
| Object Hallucination Evaluation | POPE | Accuracy87.5 | 935 | |
| Object Detection | COCO (val) | mAP56.7 | 613 | |
| Multimodal Evaluation | MME | -- | 557 | |
| Visual Question Answering | ScienceQA | Accuracy94.4 | 210 | |
| Multimodal Model Evaluation | MMBench | Accuracy77.1 | 180 | |
| Visual Question Answering | VQAv2 | Accuracy81.4 | 177 |