Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

About

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVizWiz
Accuracy78.8
1525
Object Hallucination EvaluationPOPE
Accuracy88.1
1455
Visual Question AnsweringVQA v2
Accuracy79.5
1362
Visual Question AnsweringTextVQA
Accuracy78.9
1285
Visual Question AnsweringGQA
Accuracy60.7
1249
Text-based Visual Question AnsweringTextVQA
Accuracy63.8
807
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy78.8
706
Multimodal EvaluationMME
Score1.85e+3
658
Multimodal UnderstandingMMBench
Accuracy77.6
637
Multimodal UnderstandingMM-Vet
MM-Vet Score66.6
531
Showing 10 of 593 rows
...

Other info

Code

Follow for update