Exploring Plain Vision Transformer Backbones for Object Detection
About
We explore the plain, non-hierarchical Vision Transformer (ViT) as a backbone network for object detection. This design enables the original ViT architecture to be fine-tuned for object detection without needing to redesign a hierarchical backbone for pre-training. With minimal adaptations for fine-tuning, our plain-backbone detector can achieve competitive results. Surprisingly, we observe: (i) it is sufficient to build a simple feature pyramid from a single-scale feature map (without the common FPN design) and (ii) it is sufficient to use window attention (without shifting) aided with very few cross-window propagation blocks. With plain ViT backbones pre-trained as Masked Autoencoders (MAE), our detector, named ViTDet, can compete with the previous leading methods that were all based on hierarchical backbones, reaching up to 61.3 AP_box on the COCO dataset using only ImageNet-1K pre-training. We hope our study will draw attention to research on plain-backbone detectors. Code for ViTDet is available in Detectron2.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP60.4 | 2454 | |
| Instance Segmentation | COCO 2017 (val) | APm0.425 | 1144 | |
| Object Detection | COCO (val) | -- | 613 | |
| Object Detection | LVIS v1.0 (val) | APbbox53.4 | 518 | |
| Instance Segmentation | COCO (val) | APmk53.1 | 472 | |
| Oriented Object Detection | DOTA v1.0 (test) | SV78.52 | 378 | |
| Image Classification | iNaturalist 2018 | Top-1 Accuracy86.8 | 287 | |
| Object Detection | COCO 2017 | AP (Box)51.6 | 279 | |
| Object Detection | MS-COCO 2017 (val) | -- | 237 | |
| Instance Segmentation | COCO 2017 | APm52 | 199 |