VisionLLM: Large Language Model is also an Open-Ended Decoder for Vision-Centric Tasks

About

Large language models (LLMs) have notably accelerated progress towards artificial general intelligence (AGI), with their impressive zero-shot capacity for user-tailored tasks, endowing them with immense potential across a range of applications. However, in the field of computer vision, despite the availability of numerous powerful vision foundation models (VFMs), they are still restricted to tasks in a pre-defined form, struggling to match the open-ended task capabilities of LLMs. In this work, we present an LLM-based framework for vision-centric tasks, termed VisionLLM. This framework provides a unified perspective for vision and language tasks by treating images as a foreign language and aligning vision-centric tasks with language tasks that can be flexibly defined and managed using language instructions. An LLM-based decoder can then make appropriate predictions based on these instructions for open-ended tasks. Extensive experiments show that the proposed VisionLLM can achieve different levels of task customization through language instructions, from fine-grained object-level to coarse-grained task-level customization, all with good results. It's noteworthy that, with a generalist LLM-based framework, our model can achieve over 60\% mAP on COCO, on par with detection-specific models. We hope this model can set a new baseline for generalist vision and language models. The demo shall be released based on https://github.com/OpenGVLab/InternGPT. The code shall be released at https://github.com/OpenGVLab/VisionLLM.

Wenhai Wang, Zhe Chen, Xiaokang Chen, Jiannan Wu, Xizhou Zhu, Gang Zeng, Ping Luo, Tong Lu, Jie Zhou, Yu Qiao, Jifeng Dai• 2023

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP60.2	2843
Instance Segmentation	COCO 2017 (val)	--	1275
Object Detection	COCO (val)	mAP44.6	637
Instance Segmentation	COCO (val)	APmk25.1	485
Referring Expression Comprehension	RefCOCO+ (val)	--	354
Referring Expression Comprehension	RefCOCO (val)	Accuracy86.7	348
Referring Expression Comprehension	RefCOCO (testA)	Accuracy86.7	346
Object Detection	COCO 2017	AP (Box)44.8	345
Referring Expression Comprehension	RefCOCOg (test)	Accuracy82.9	300
Referring Expression Comprehension	RefCOCOg (val)	Accuracy82.9	300

Showing 10 of 22 rows

Other info

Code

Follow for update

@wizwand_team Discord