Sparse R-CNN: End-to-End Object Detection with Learnable Proposals
About
We present Sparse R-CNN, a purely sparse method for object detection in images. Existing works on object detection heavily rely on dense object candidates, such as $k$ anchor boxes pre-defined on all grids of image feature map of size $H\times W$. In our method, however, a fixed sparse set of learned object proposals, total length of $N$, are provided to object recognition head to perform classification and location. By eliminating $HWk$ (up to hundreds of thousands) hand-designed object candidates to $N$ (e.g. 100) learnable proposals, Sparse R-CNN completely avoids all efforts related to object candidates design and many-to-one label assignment. More importantly, final predictions are directly output without non-maximum suppression post-procedure. Sparse R-CNN demonstrates accuracy, run-time and training convergence performance on par with the well-established detector baselines on the challenging COCO dataset, e.g., achieving 45.0 AP in standard $3\times$ training schedule and running at 22 fps using ResNet-50 FPN model. We hope our work could inspire re-thinking the convention of dense prior in object detectors. The code is available at: https://github.com/PeizeSun/SparseR-CNN.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP46.4 | 2454 | |
| Object Detection | COCO (test-dev) | mAP51.5 | 1195 | |
| Object Detection | MS COCO (test-dev) | mAP@.563.5 | 677 | |
| Object Detection | COCO (val) | mAP50.8 | 613 | |
| Object Detection | COCO v2017 (test-dev) | mAP51.5 | 499 | |
| Video Object Detection | ImageNet VID (val) | -- | 341 | |
| Object Detection | COCO (minival) | mAP46.4 | 184 | |
| Object Detection | MS-COCO (val) | mAP0.445 | 138 | |
| Object Detection | AI-TOD (test) | AP@0.538.5 | 88 | |
| Pedestrian Detection | CityPersons (val) | -- | 85 |