Real-time Transformer-based Open-Vocabulary Detection with Efficient Fusion Head

About

End-to-end transformer-based detectors (DETRs) have shown exceptional performance in both closed-set and open-vocabulary object detection (OVD) tasks through the integration of language modalities. However, their demanding computational requirements have hindered their practical application in real-time object detection (OD) scenarios. In this paper, we scrutinize the limitations of two leading models in the OVDEval benchmark, OmDet and Grounding-DINO, and introduce OmDet-Turbo. This novel transformer-based real-time OVD model features an innovative Efficient Fusion Head (EFH) module designed to alleviate the bottlenecks observed in OmDet and Grounding-DINO. Notably, OmDet-Turbo-Base achieves a 100.2 frames per second (FPS) with TensorRT and language cache techniques applied. Notably, in zero-shot scenarios on COCO and LVIS datasets, OmDet-Turbo achieves performance levels nearly on par with current state-of-the-art supervised models. Furthermore, it establishes new state-of-the-art benchmarks on ODinW and OVDEval, boasting an AP of 30.1 and an NMS-AP of 26.86, respectively. The practicality of OmDet-Turbo in industrial applications is underscored by its exceptional performance on benchmark datasets and superior inference speed, positioning it as a compelling choice for real-time object detection tasks. Code: \url{https://github.com/om-ai-lab/OmDet}

Tiancheng Zhao, Peng Liu, Xuan He, Lu Zhang, Kyusong Lee• 2024

Related benchmarks

Task	Dataset	Result
Object Detection	COCO (val)	--	637
Object Detection	LVIS (minival)	AP34.7	159
Object Detection	ODinW-13	AP54.7	98
Object Detection	COCO	AP (bbox)53.4	66
Object Detection	ODinW 35 datasets (test)	Average AP30.1	15
Object Detection	COCO (val)	Recall @ IoU=0.561.02	10

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord