LW-DETR: A Transformer Replacement to YOLO for Real-Time Detection
About
In this paper, we present a light-weight detection transformer, LW-DETR, which outperforms YOLOs for real-time object detection. The architecture is a simple stack of a ViT encoder, a projector, and a shallow DETR decoder. Our approach leverages recent advanced techniques, such as training-effective techniques, e.g., improved loss and pretraining, and interleaved window and global attentions for reducing the ViT encoder complexity. We improve the ViT encoder by aggregating multi-level feature maps, and the intermediate and final feature maps in the ViT encoder, forming richer feature maps, and introduce window-major feature map organization for improving the efficiency of interleaved attention computation. Experimental results demonstrate that the proposed approach is superior over existing real-time detectors, e.g., YOLO and its variants, on COCO and other benchmark datasets. Code and models are available at (https://github.com/Atten4Vis/LW-DETR).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP58.3 | 2843 | |
| Grounding | Breast Ultrasound same-center (val) | F1 Score @ IoU=0.2549.32 | 12 | |
| Grounding | Renal Ultrasound cross-center (test) | F1 Score @ IoU=0.2564.75 | 12 | |
| Ultrasound Diagnosis | Breast Ultrasound (val) | Overall Accuracy51.41 | 8 | |
| Ultrasound Diagnosis | breast ultrasound (test) | Overall Accuracy46.96 | 8 | |
| Grounding | Breast Ultrasound cross-center (test) | F1 Score @ IoU=0.2543.3 | 6 | |
| Object Detection | Roboflow100 VL | AP59.8 | 6 | |
| Visual Grounding | breast ultrasound (test) | F1 Score (IoU=0.25)43.3 | 6 | |
| Grounding | Renal Ultrasound same-center (val) | F1@0.2571.83 | 6 | |
| Diagnosis | Renal Ultrasound cross-center (test) | Accuracy69.13 | 4 |