GLIPv2: Unifying Localization and Vision-Language Understanding

About

We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.

Haotian Zhang, Pengchuan Zhang, Xiaowei Hu, Yen-Chun Chen, Liunian Harold Li, Xiyang Dai, Lijuan Wang, Lu Yuan, Jenq-Neng Hwang, Jianfeng Gao• 2022

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP58.8	2843
Object Detection	COCO (test-dev)	mAP63.4	1239
Object Detection	COCO (val)	--	637
Object Detection	COCO v2017 (test-dev)	mAP62.4	499
Object Detection	LVIS (minival)	AP59.8	159
Object Detection	LVIS mini (val)	mAP59.8	120
Object Detection	ODinW-13	AP70.4	98
Object Detection	ODinW-35	AP22.3	79
Object Detection	COCO	AP (bbox)60.6	66
Object Detection	LVIS	APr45.8	59

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord