Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

DetCLIPv2: Scalable Open-Vocabulary Object Detection Pre-training via Word-Region Alignment

About

This paper presents DetCLIPv2, an efficient and scalable training framework that incorporates large-scale image-text pairs to achieve open-vocabulary object detection (OVD). Unlike previous OVD frameworks that typically rely on a pre-trained vision-language model (e.g., CLIP) or exploit image-text pairs via a pseudo labeling process, DetCLIPv2 directly learns the fine-grained word-region alignment from massive image-text pairs in an end-to-end manner. To accomplish this, we employ a maximum word-region similarity between region proposals and textual words to guide the contrastive objective. To enable the model to gain localization capability while learning broad concepts, DetCLIPv2 is trained with a hybrid supervision from detection, grounding and image-text pair data under a unified data formulation. By jointly training with an alternating scheme and adopting low-resolution input for image-text pairs, DetCLIPv2 exploits image-text pair data efficiently and effectively: DetCLIPv2 utilizes 13X more image-text pairs than DetCLIP with a similar training time and improves performance. With 13M image-text pairs for pre-training, DetCLIPv2 demonstrates superior open-vocabulary detection performance, e.g., DetCLIPv2 with Swin-T backbone achieves 40.4% zero-shot AP on the LVIS benchmark, which outperforms previous works GLIP/GLIPv2/DetCLIP by 14.4/11.4/4.5% AP, respectively, and even beats its fully-supervised counterpart by a large margin.

Lewei Yao, Jianhua Han, Xiaodan Liang, Dan Xu, Wei Zhang, Zhenguo Li, Hang Xu• 2023

Related benchmarks

TaskDatasetResultRank
Object DetectionLVIS v1.0 (val)
APbbox36.6
518
Object DetectionLVIS (val)
mAP53.1
141
Object DetectionLVIS (minival)
AP44.7
127
Object DetectionODinW-13
AP70.4
98
Object DetectionLVIS mini (val)
mAP60.1
86
Object DetectionCOCO
AP (bbox)44.7
59
Open-vocabulary object detectionLVIS v1 (val)
AP_r^b33.3
54
Object DetectionODinW 13 datasets (test)
AP70.4
28
Object DetectionLVIS 1.0 (minival)
AP60.1
26
Object DetectionLVIS (minival5k)
AP44.7
18
Showing 10 of 16 rows

Other info

Follow for update