Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

About

In this work, we introduce the Qwen-VL series, a set of large-scale vision-language models (LVLMs) designed to perceive and understand both texts and images. Starting from the Qwen-LM as a foundation, we endow it with visual capacity by the meticulously designed (i) visual receptor, (ii) input-output interface, (iii) 3-stage training pipeline, and (iv) multilingual multimodal cleaned corpus. Beyond the conventional image description and question-answering, we implement the grounding and text-reading ability of Qwen-VLs by aligning image-caption-box tuples. The resulting models, including Qwen-VL and Qwen-VL-Chat, set new records for generalist models under similar model scales on a broad range of visual-centric benchmarks (e.g., image captioning, question answering, visual grounding) and different settings (e.g., zero-shot, few-shot). Moreover, on real-world dialog benchmarks, our instruction-tuned Qwen-VL-Chat also demonstrates superiority compared to existing vision-language chatbots. Code, demo and models are available at https://github.com/QwenLM/Qwen-VL.

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, Jingren Zhou• 2023

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy88.1	2019
Visual Question Answering	VizWiz	Accuracy78.8	1820
Visual Question Answering	TextVQA	Accuracy78.9	1453
Visual Question Answering	VQA v2	Accuracy79.5	1429
Visual Question Answering	GQA	Accuracy60.7	1425
Text-based Visual Question Answering	TextVQA	Accuracy63.8	962
Multimodal Understanding	MMBench	Accuracy77.6	847
Science Question Answering	ScienceQA	Accuracy68.2	791
Multimodal Evaluation	MME	Score1.85e+3	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy78.8	712

Showing 10 of 732 rows

...

Other info

Code

Follow for update

@wizwand_team Discord