DINO-X: A Unified Vision Model for Open-World Object Detection and Understanding
About
In this paper, we introduce DINO-X, which is a unified object-centric vision model developed by IDEA Research with the best open-world object detection performance to date. DINO-X employs the same Transformer-based encoder-decoder architecture as Grounding DINO 1.5 to pursue an object-level representation for open-world object understanding. To make long-tailed object detection easy, DINO-X extends its input options to support text prompt, visual prompt, and customized prompt. With such flexible prompt options, we develop a universal object prompt to support prompt-free open-world detection, making it possible to detect anything in an image without requiring users to provide any prompt. To enhance the model's core grounding capability, we have constructed a large-scale dataset with over 100 million high-quality grounding samples, referred to as Grounding-100M, for advancing the model's open-vocabulary detection performance. Pre-training on such a large-scale grounding dataset leads to a foundational object-level representation, which enables DINO-X to integrate multiple perception heads to simultaneously support multiple object perception and understanding tasks, including detection, segmentation, pose estimation, object captioning, object-based QA, etc. Experimental results demonstrate the superior performance of DINO-X. Specifically, the DINO-X Pro model achieves 56.0 AP, 59.8 AP, and 52.4 AP on the COCO, LVIS-minival, and LVIS-val zero-shot object detection benchmarks, respectively. Notably, it scores 63.3 AP and 56.5 AP on the rare classes of LVIS-minival and LVIS-val benchmarks, improving the previous SOTA performance by 5.8 AP and 5.0 AP. Such a result underscores its significantly improved capacity for recognizing long-tailed objects.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | LVIS (val) | mAP38.4 | 141 | |
| Object Detection | LVIS mini (val) | mAP44.5 | 94 | |
| Counting | CountBench | Accuracy82.9 | 82 | |
| Instance Segmentation | LVIS | mAP (Mask)38.5 | 81 | |
| Object Detection | COCO | AP (bbox)56 | 66 | |
| Object Detection | LVIS | -- | 59 | |
| Landing Zone Selection | Custom Urban Delivery Dataset (test) | AP83.5 | 40 | |
| Box Detection | SA-Co | Gold cgF122.5 | 20 | |
| Referring Expression Comprehension | KVG-Bench | Accuracy (Air, Seen Categories)43.42 | 17 | |
| Counting | PixMo-Count | Accuracy85 | 11 |