Do We Really Need Scene-specific Pose Encoders?

About

Visual pose regression models estimate the camera pose from a query image with a single forward pass. Current models learn pose encoding from an image using deep convolutional networks which are trained per scene. The resulting encoding is typically passed to a multi-layer perceptron in order to regress the pose. In this work, we propose that scene-specific pose encoders are not required for pose regression and that encodings trained for visual similarity can be used instead. In order to test our hypothesis, we take a shallow architecture of several fully connected layers and train it with pre-computed encodings from a generic image retrieval model. We find that these encodings are not only sufficient to regress the camera pose, but that, when provided to a branching fully connected architecture, a trained model can achieve competitive results and even surpass current \textit{state-of-the-art} pose regressors in some cases. Moreover, we show that for outdoor localization, the proposed architecture is the only pose regressor, to date, consistently localizing in under 2 meters and 5 degrees.

Yoli Shavit, Ron Ferens• 2020

Related benchmarks

Task	Dataset	Result
Camera Localization	7 Scenes	Average Position Error (m)0.23	46
Camera Localization	7-Scenes Chess	Translation Error (m)0.13	40
Visual Localization	Cambridge Landmarks (test)	Avg Median Positional Error (m)1.42	35
Camera Pose Regression	7Scenes Fire	Median Position Error (m)0.25	26
Camera Pose Regression	7Scenes Heads	Median Position Error (m)0.15	26
Camera Pose Regression	7Scenes Pumpkin	Median Position Error (m)0.22	26
Camera Pose Regression	7Scenes	Median Position Error (m)0.23	26
Camera Pose Regression	7Scenes (Office)	Median Position Error (m)0.24	26
Camera Pose Regression	7Scenes Stairs	Median Position Error (m)0.34	26
Camera Pose Regression	7Scenes Kitchen	Median Position Error (m)0.3	26

Showing 10 of 16 rows

Other info

Follow for update

@wizwand_team Discord