Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Grounding DINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection

About

In this paper, we present an open-set object detector, called Grounding DINO, by marrying Transformer-based detector DINO with grounded pre-training, which can detect arbitrary objects with human inputs such as category names or referring expressions. The key solution of open-set object detection is introducing language to a closed-set detector for open-set concept generalization. To effectively fuse language and vision modalities, we conceptually divide a closed-set detector into three phases and propose a tight fusion solution, which includes a feature enhancer, a language-guided query selection, and a cross-modality decoder for cross-modality fusion. While previous works mainly evaluate open-set object detection on novel categories, we propose to also perform evaluations on referring expression comprehension for objects specified with attributes. Grounding DINO performs remarkably well on all three settings, including benchmarks on COCO, LVIS, ODinW, and RefCOCO/+/g. Grounding DINO achieves a $52.5$ AP on the COCO detection zero-shot transfer benchmark, i.e., without any training data from COCO. It sets a new record on the ODinW zero-shot benchmark with a mean $26.1$ AP. Code will be available at \url{https://github.com/IDEA-Research/GroundingDINO}.

Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, Jun Zhu, Lei Zhang• 2023

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP63
2643
Object DetectionCOCO (val)
mAP48.4
633
Object DetectionLVIS v1.0 (val)
APbbox32.3
529
Object DetectionCOCO v2017 (test-dev)
mAP63
499
Referring Expression ComprehensionRefCOCO+ (val)
Accuracy82.8
354
Referring Expression ComprehensionRefCOCO (val)
Accuracy90.6
344
Referring Expression ComprehensionRefCOCO (testA)
Accuracy93.19
342
Object CountingFSC-147 (test)
MAE59.23
322
Object DetectionCOCO 2017
AP (Box)62.6
321
Referring Expression ComprehensionRefCOCOg (test)
Accuracy87.02
300
Showing 10 of 258 rows
...

Other info

Code

Follow for update