Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

About

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou• 2023

Related benchmarks

TaskDatasetResultRank
Visual Question AnsweringVQA v2
Accuracy79.5
1165
Visual Question AnsweringTextVQA
Accuracy78.9
1117
Visual Question AnsweringVizWiz
Accuracy78.8
1043
Visual Question AnsweringGQA
Accuracy59.3
963
Object Hallucination EvaluationPOPE
Accuracy88.1
935
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy78.8
664
Multimodal EvaluationMME
Score1.85e+3
557
Text-based Visual Question AnsweringTextVQA
Accuracy63.8
496
Multimodal UnderstandingMM-Vet
MM-Vet Score66.6
418
Visual Question AnsweringGQA
Accuracy59.3
374
Showing 10 of 511 rows
...

Other info

Code

Follow for update