Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
About
In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy79.5 | 1165 | |
| Visual Question Answering | TextVQA | Accuracy78.9 | 1117 | |
| Visual Question Answering | VizWiz | Accuracy78.8 | 1043 | |
| Visual Question Answering | GQA | Accuracy59.3 | 963 | |
| Object Hallucination Evaluation | POPE | Accuracy88.1 | 935 | |
| Visual Question Answering | VQA v2 (test-dev) | Overall Accuracy78.8 | 664 | |
| Multimodal Evaluation | MME | Score1.85e+3 | 557 | |
| Text-based Visual Question Answering | TextVQA | Accuracy63.8 | 496 | |
| Multimodal Understanding | MM-Vet | MM-Vet Score66.6 | 418 | |
| Visual Question Answering | GQA | Accuracy59.3 | 374 |