PromptDet: Towards Open-vocabulary Detection using Uncurated Images
About
The goal of this work is to establish a scalable pipeline for expanding an object detector towards novel/unseen categories, using zero manual annotations. To achieve that, we make the following four contributions: (i) in pursuit of generalisation, we propose a two-stage open-vocabulary object detector, where the class-agnostic object proposals are classified with a text encoder from pre-trained visual-language model; (ii) To pair the visual latent space (of RPN box proposals) with that of the pre-trained text encoder, we propose the idea of regional prompt learning to align the textual embedding space with regional visual object features; (iii) To scale up the learning procedure towards detecting a wider spectrum of objects, we exploit the available online resource via a novel self-training framework, which allows to train the proposed detector on a large corpus of noisy uncurated web images. Lastly, (iv) to evaluate our proposed detector, termed as PromptDet, we conduct extensive experiments on the challenging LVIS and MS-COCO dataset. PromptDet shows superior performance over existing approaches with fewer additional training images and zero manual annotations whatsoever. Project page with code: https://fcjian.github.io/promptdet.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Object Detection | LVIS v1.0 (val) | APbbox25.3 | 518 | |
| Instance Segmentation | LVIS v1.0 (val) | -- | 189 | |
| Object Detection | OV-COCO | AP50 (Novel)26.6 | 97 | |
| Instance Segmentation | LVIS | mAP (Mask)25.3 | 68 | |
| Open-vocabulary object detection | LVIS v1 (val) | AP_r^b21.4 | 54 | |
| Instance Segmentation | LVIS (val) | APr21.4 | 46 | |
| Object Detection | COCO open-vocabulary (test) | Novel AP26.6 | 25 | |
| Open-vocabulary object detection | OV-LVIS | AP Novel19 | 18 | |
| Object Detection | OV-LVIS v1 (val) | AP_mask_novel19 | 17 |