Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

About

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou• 2023

Related benchmarks

TaskDatasetResultRank
Object Hallucination EvaluationPOPE
Accuracy88.1
2019
Visual Question AnsweringVizWiz
Accuracy78.8
1820
Visual Question AnsweringTextVQA
Accuracy78.9
1453
Visual Question AnsweringVQA v2
Accuracy79.5
1429
Visual Question AnsweringGQA
Accuracy60.7
1425
Text-based Visual Question AnsweringTextVQA
Accuracy63.8
962
Multimodal UnderstandingMMBench
Accuracy77.6
847
Science Question AnsweringScienceQA
Accuracy68.2
791
Multimodal EvaluationMME
Score1.85e+3
727
Visual Question AnsweringVQA v2 (test-dev)
Overall Accuracy78.8
712
Showing 10 of 732 rows
...

Other info

Code

Follow for update