GLIPv2: Unifying Localization and Vision-Language Understanding
About
We present GLIPv2, a grounded VL understanding model, that serves both localization tasks (e.g., object detection, instance segmentation) and Vision-Language (VL) understanding tasks (e.g., VQA, image captioning). GLIPv2 elegantly unifies localization pre-training and Vision-Language Pre-training (VLP) with three pre-training tasks: phrase grounding as a VL reformulation of the detection task, region-word contrastive learning as a novel region-word level contrastive learning task, and the masked language modeling. This unification not only simplifies the previous multi-stage VLP procedure but also achieves mutual benefits between localization and understanding tasks. Experimental results show that a single GLIPv2 model (all model weights are shared) achieves near SoTA performance on various localization and understanding tasks. The model also shows (1) strong zero-shot and few-shot adaption performance on open-vocabulary object detection tasks and (2) superior grounding capability on VL understanding tasks. Code will be released at https://github.com/microsoft/GLIP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP58.8 | 2454 | |
| Object Detection | COCO (test-dev) | mAP63.4 | 1195 | |
| Object Detection | COCO (val) | -- | 613 | |
| Object Detection | COCO v2017 (test-dev) | mAP62.4 | 499 | |
| Object Detection | LVIS (minival) | AP59.8 | 127 | |
| Object Detection | ODinW-13 | AP70.4 | 98 | |
| Object Detection | LVIS mini (val) | mAP59.8 | 86 | |
| Object Detection | COCO | AP (bbox)60.6 | 59 | |
| Object Detection | LVIS | APr45.8 | 59 | |
| Object Detection | ODinW-35 | AP22.3 | 59 |