CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection
About
Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and 44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Object Detection | COCO (val) | mAP39.1 | 613 | |
| Object Detection | COCO | AP50 (Box)55.1 | 190 | |
| Instance Segmentation | LVIS v1.0 (val) | -- | 189 | |
| Object Detection | OV-COCO | AP50 (Novel)30.6 | 97 | |
| Instance Segmentation | LVIS | mAP (Mask)30.7 | 68 | |
| Object Detection | LVIS | APr24.5 | 59 | |
| Object Detection | Objects365 (val) | mAP14.2 | 48 | |
| Instance Segmentation | LVIS (val) | APr29.4 | 46 | |
| Object Detection | Objects365 v1 (val) | AP14.2 | 30 |