D2-Net: A Trainable CNN for Joint Detection and Description of Local Features
About
In this work we address the problem of finding reliable pixel-level correspondences under difficult imaging conditions. We propose an approach where a single convolutional neural network plays a dual role: It is simultaneously a dense feature descriptor and a feature detector. By postponing the detection to a later stage, the obtained keypoints are more stable than their traditional counterparts based on early detection of low-level structures. We show that this model can be trained using pixel correspondences extracted from readily available large-scale SfM reconstructions, without any further annotations. The proposed method obtains state-of-the-art performance on both the difficult Aachen Day-Night localization dataset and the InLoc indoor localization benchmark, as well as competitive performance on other benchmarks for image matching and 3D reconstruction.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Homography Estimation | HPatches | Overall Accuracy (< 1px)16.7 | 59 | |
| Image Matching | Kinect 1 | MS0.2 | 38 | |
| Image Matching | Simulation | MS11 | 38 | |
| Image Matching | Kinect 2 | Matching Score (MS)0.23 | 38 | |
| Image Matching | DeSurT (833 pairs total) | MS Score14 | 38 | |
| Visual Localization | RobotCar Seasons (night) | Recall (0.25m, 2°)20.4 | 35 | |
| Homography Estimation | HPatches | AUC @3px23.2 | 35 | |
| Visual Localization | Extended CMU Seasons Urban | Recall @ (0.25m, 2°)94 | 34 | |
| Homography Estimation | HPatches (viewpoint) | Accuracy (<1px)3.7 | 27 | |
| 3D Triangulation | ETH3D (train) | Accuracy (1cm)74.75 | 24 |