Do We Really Need Scene-specific Pose Encoders?
About
Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. The resulting encoding is typically passed to a multi-layer perceptron in order to regress the pose. In this work, we propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead. In order to test our hypothesis, we take a shallow architecture of several fully connected layers and train it with pre-computed encodings from a generic image retrieval model. We find that these encodings are not only sufficient to regress the camera pose, but that, when provided to a branching fully connected architecture, a trained model can achieve competitive results and even surpass current \textit{state-of-the-art} pose regressors in some cases. Moreover, we show that for outdoor localization, the proposed architecture is the only pose regressor, to date, consistently localizing in under 2 meters and 5 degrees.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera Localization | 7 Scenes | Average Position Error (m)0.23 | 46 | |
| Camera Localization | 7-Scenes Chess | Translation Error (m)0.13 | 40 | |
| Visual Localization | Cambridge Landmarks (test) | Avg Median Positional Error (m)1.42 | 35 | |
| Camera Pose Regression | 7Scenes Fire | Median Position Error (m)0.25 | 26 | |
| Camera Pose Regression | 7Scenes Heads | Median Position Error (m)0.15 | 26 | |
| Camera Pose Regression | 7Scenes Pumpkin | Median Position Error (m)0.22 | 26 | |
| Camera Pose Regression | 7Scenes | Median Position Error (m)0.23 | 26 | |
| Camera Pose Regression | 7Scenes (Office) | Median Position Error (m)0.24 | 26 | |
| Camera Pose Regression | 7Scenes Stairs | Median Position Error (m)0.34 | 26 | |
| Camera Pose Regression | 7Scenes Kitchen | Median Position Error (m)0.3 | 26 |