Simple Open-Vocabulary Object Detection with Vision Transformers
About
Combining simple architectures with large-scale pre-training has led to massive improvements in image classification. For object detection, pre-training and scaling approaches are less well established, especially in the long-tailed and open-vocabulary setting, where training data is relatively scarce. In this paper, we propose a strong recipe for transferring image-text models to open-vocabulary object detection. We use a standard Vision Transformer architecture with minimal modifications, contrastive image-text pre-training, and end-to-end detection fine-tuning. Our analysis of the scaling properties of this setup shows that increasing image-level pre-training and model size yield consistent improvements on the downstream detection task. We provide the adaptation strategies and regularizations needed to attain very strong performance on zero-shot text-conditioned and one-shot image-conditioned object detection. Code and models are available on GitHub.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP43.5 | 2454 | |
| Object Detection | LVIS v1.0 (val) | APbbox35.3 | 518 | |
| Referring Expression Comprehension | RefCOCO (testA) | -- | 333 | |
| Referring Expression Comprehension | RefCOCO+ (testA) | -- | 207 | |
| Object Detection | LVIS (minival) | AP34.6 | 127 | |
| Object Detection | ODinW-13 | AP40.9 | 98 | |
| Object Detection | OV-COCO | AP50 (Novel)41.8 | 97 | |
| Instance Segmentation | LVIS | mAP (Mask)34.7 | 68 | |
| Object Detection | LVIS | APr31.2 | 59 | |
| Object Detection | ODinW-35 | AP18.8 | 59 |