Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CoDet: Co-Occurrence Guided Region-Word Alignment for Open-Vocabulary Object Detection

About

Deriving reliable region-word alignment from image-text pairs is critical to learn object-level vision-language representations for open-vocabulary object detection. Existing methods typically rely on pre-trained or self-trained vision-language models for alignment, which are prone to limitations in localization accuracy or generalization capabilities. In this paper, we propose CoDet, a novel approach that overcomes the reliance on pre-aligned vision-language space by reformulating region-word alignment as a co-occurring object discovery problem. Intuitively, by grouping images that mention a shared concept in their captions, objects corresponding to the shared concept shall exhibit high co-occurrence among the group. CoDet then leverages visual similarities to discover the co-occurring objects and align them with the shared concept. Extensive experiments demonstrate that CoDet has superior performances and compelling scalability in open-vocabulary detection, e.g., by scaling up the visual backbone, CoDet achieves 37.0 $\text{AP}^m_{novel}$ and 44.7 $\text{AP}^m_{all}$ on OV-LVIS, surpassing the previous SoTA by 4.2 $\text{AP}^m_{novel}$ and 9.8 $\text{AP}^m_{all}$. Code is available at https://github.com/CVMI-Lab/CoDet.

Chuofan Ma, Yi Jiang, Xin Wen, Zehuan Yuan, Xiaojuan Qi• 2023

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)--
2454
Object DetectionCOCO (val)
mAP39.1
613
Object DetectionCOCO
AP50 (Box)55.1
190
Instance SegmentationLVIS v1.0 (val)--
189
Object DetectionOV-COCO
AP50 (Novel)30.6
97
Instance SegmentationLVIS
mAP (Mask)30.7
68
Object DetectionLVIS
APr24.5
59
Object DetectionObjects365 (val)
mAP14.2
48
Instance SegmentationLVIS (val)
APr29.4
46
Object DetectionObjects365 v1 (val)
AP14.2
30
Showing 10 of 25 rows

Other info

Code

Follow for update