LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection

About

In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).

Qiang Chen, Xiangbo Su, Xinyu Zhang, Jian Wang, Jiahui Chen, Yunpeng Shen, Chuchu Han, Ziliang Chen, Weixiang Xu, Fanrong Li, Shan Zhang, Kun Yao, Errui Ding, Gang Zhang, Jingdong Wang• 2024

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP58.3	2930
Object Detection	S-DGOD (test)	AP (NC)49.1	43
Grounding	Breast Ultrasound same-center (val)	F1 Score @ IoU=0.2549.32	12
Grounding	Renal Ultrasound cross-center (test)	F1 Score @ IoU=0.2564.75	12
Ultrasound Diagnosis	Breast Ultrasound (val)	Overall Accuracy51.41	8
Ultrasound Diagnosis	breast ultrasound (test)	Overall Accuracy46.96	8
Grounding	Breast Ultrasound cross-center (test)	F1 Score @ IoU=0.2543.3	6
Object Detection	Roboflow100 VL	AP59.8	6
Visual Grounding	breast ultrasound (test)	F1 Score (IoU=0.25)43.3	6
Grounding	Renal Ultrasound same-center (val)	F1@0.2571.83	6

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord