DISK: Learning local features with policy gradient
About
Local feature frameworks are difficult to learn in an end-to-end fashion, due to the discreteness inherent to the selection and matching of sparse keypoints. We introduce DISK (DIScrete Keypoints), a novel method that overcomes these obstacles by leveraging principles from Reinforcement Learning (RL), optimizing end-to-end for a high number of correct feature matches. Our simple yet expressive probabilistic model lets us keep the training and inference regimes close, while maintaining good enough convergence properties to reliably train from scratch. Our features can be extracted very densely while remaining discriminative, challenging commonly held assumptions about what constitutes a good keypoint, as showcased in Fig. 1, and deliver state-of-the-art results on three public benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic Correspondence | PF-WILLOW | PCK@0.1 (bbox)17 | 109 | |
| Relative Pose Estimation | MegaDepth 1500 | AUC @ 5°54.68 | 104 | |
| Relative Pose Estimation | MegaDepth (test) | Pose AUC @5°45.31 | 83 | |
| Homography Estimation | HPatches | Overall Accuracy (< 1px)51.3 | 59 | |
| Visual Localization | Aachen Day-Night v1.1 (Night) | Success Rate (0.25m, 2°)78 | 58 | |
| Pose Estimation | KITTI odometry | AUC584.14 | 51 | |
| Visual Localization | Aachen Day-Night v1.1 (Day) | SR (0.25m, 2°)87.3 | 50 | |
| Image Matching | Kinect 1 | MS0.53 | 38 | |
| Image Matching | DeSurT (833 pairs total) | MS Score44 | 38 | |
| Image Matching | Kinect 2 | Matching Score (MS)0.52 | 38 |