Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

About

Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and 44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.

Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi• 2023

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)--
2643
Object DetectionCOCO (val)
mAP39.1
633
Object DetectionCOCO
AP50 (Box)55.1
237
Instance SegmentationLVIS v1.0 (val)--
189
Object DetectionOV-COCO
AP50 (Novel)30.6
130
Instance SegmentationLVIS
mAP (Mask)30.7
81
Object DetectionObjects365 (val)
mAP14.2
60
Object DetectionLVIS
APr24.5
59
Instance SegmentationLVIS (val)
APr29.4
46
Instance SegmentationLVIS 1.0 (val)
AP (Mask)39.2
33
Showing 10 of 29 rows

Other info

Code

Follow for update