Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head
About
End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks. Code: \url{https://github.com/om-ai-lab/OmDet}
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO (val) | -- | 613 | |
| Object Detection | LVIS (minival) | AP34.7 | 127 | |
| Object Detection | ODinW-13 | AP54.7 | 98 | |
| Object Detection | COCO | AP (bbox)53.4 | 59 | |
| Object Detection | ODinW 35 datasets (test) | Average AP30.1 | 15 | |
| Object Detection | COCO (val) | Recall @ IoU=0.561.02 | 10 |