Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Exploring Plain Vision Transformer Backbones for Object Detection

About

We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.

Yanghao Li, Hanzi Mao, Ross Girshick, Kaiming He• 2022

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP60.4
2454
Instance SegmentationCOCO 2017 (val)
APm0.425
1144
Object DetectionCOCO (val)--
613
Object DetectionLVIS v1.0 (val)
APbbox53.4
518
Instance SegmentationCOCO (val)
APmk53.1
472
Oriented Object DetectionDOTA v1.0 (test)
SV78.52
378
Image ClassificationiNaturalist 2018
Top-1 Accuracy86.8
287
Object DetectionCOCO 2017
AP (Box)51.6
279
Object DetectionMS-COCO 2017 (val)--
237
Instance SegmentationCOCO 2017
APm52
199
Showing 10 of 36 rows

Other info

Code

Follow for update