Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks
About
State-of-the-art object detection networks depend on region proposal algorithms to hypothesize object locations. Advances like SPPnet and Fast R-CNN have reduced the running time of these detection networks, exposing region proposal computation as a bottleneck. In this work, we introduce a Region Proposal Network (RPN) that shares full-image convolutional features with the detection network, thus enabling nearly cost-free region proposals. An RPN is a fully convolutional network that simultaneously predicts object bounds and objectness scores at each position. The RPN is trained end-to-end to generate high-quality region proposals, which are used by Fast R-CNN for detection. We further merge RPN and Fast R-CNN into a single network by sharing their convolutional features---using the recently popular terminology of neural networks with 'attention' mechanisms, the RPN component tells the unified network where to look. For the very deep VGG-16 model, our detection system has a frame rate of 5fps (including all steps) on a GPU, while achieving state-of-the-art object detection accuracy on PASCAL VOC 2007, 2012, and MS COCO datasets with only 300 proposals per image. In ILSVRC and COCO 2015 competitions, Faster R-CNN and RPN are the foundations of the 1st-place winning entries in several tracks. Code has been made publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP46 | 2454 | |
| Image Classification | ImageNet (val) | Top-1 Acc66.82 | 1206 | |
| Object Detection | COCO (test-dev) | mAP40.6 | 1195 | |
| Object Detection | PASCAL VOC 2007 (test) | mAP79.5 | 821 | |
| Object Detection | MS COCO (test-dev) | mAP@.562.7 | 677 | |
| Object Detection | COCO (val) | mAP44 | 613 | |
| Object Detection | LVIS v1.0 (val) | APbbox24.1 | 518 | |
| Object Detection | COCO v2017 (test-dev) | mAP37.2 | 499 | |
| Oriented Object Detection | DOTA v1.0 (test) | SV78.17 | 378 | |
| Video Object Detection | ImageNet VID (val) | mAP (%)78.3 | 341 |