Pixel-Perfect Structure-from-Motion with Featuremetric Refinement
About
Finding local features that are repeatable across multiple views is a cornerstone of sparse 3D reconstruction. The classical image matching paradigm detects keypoints per-image once and for all, which can yield poorly-localized features and propagate large errors to the final geometry. In this paper, we refine two key steps of structure-from-motion by a direct alignment of low-level image information from multiple views: we first adjust the initial keypoint locations prior to any geometric estimation, and subsequently refine points and camera poses as a post-processing. This refinement is robust to large detection noise and appearance changes, as it optimizes a featuremetric error based on dense features predicted by a neural network. This significantly improves the accuracy of camera poses and scene geometry for a wide range of keypoint detectors, challenging viewing conditions, and off-the-shelf deep features. Our system easily scales to large image collections, enabling pixel-perfect crowd-sourced localization at scale. Our code is publicly available at https://github.com/cvg/pixel-perfect-sfm as an add-on to the popular SfM software COLMAP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Pose Estimation | KITTI odometry | AUC584.34 | 51 | |
| Pose Estimation | ScanNet | AUC @ 5 deg21.25 | 41 | |
| Multi-view pose regression | CO3D v2 | RRA@1533.7 | 31 | |
| Camera pose estimation | CO3D v2 | AUC@3030.1 | 29 | |
| 3D Triangulation | ETH3D (train) | Accuracy (1cm)79.01 | 24 | |
| Camera pose estimation | IMC | AUC (3° Threshold)0.4519 | 20 | |
| Structure-from-Motion | IMC 2021 | AUC (3° Threshold)46.3 | 17 | |
| Multi-View Camera Pose Estimation | ETH3D | AUC@1°0.5435 | 16 | |
| Multi-View Camera Pose Estimation | IMC Dataset | AUC @ 3°45.19 | 16 | |
| Multi-View Camera Pose Estimation | Texture-Poor SfM Dataset | AUC (Threshold 3°)20.66 | 16 |