Boosting 3-DoF Ground-to-Satellite Camera Localization Accuracy via Geometry-Guided Cross-View Transformer

About

Image retrieval-based cross-view localization methods often lead to very coarse camera pose estimation, due to the limited sampling density of the database satellite images. In this paper, we propose a method to increase the accuracy of a ground camera's location and orientation by estimating the relative rotation and translation between the ground-level image and its matched/retrieved satellite image. Our approach designs a geometry-guided cross-view transformer that combines the benefits of conventional geometry and learnable cross-view transformers to map the ground-view observations to an overhead view. Given the synthesized overhead view and observed satellite feature maps, we construct a neural pose optimizer with strong global information embedding ability to estimate the relative rotation between them. After aligning their rotations, we develop an uncertainty-guided spatial correlation to generate a probability map of the vehicle locations, from which the relative translation can be determined. Experimental results demonstrate that our method significantly outperforms the state-of-the-art. Notably, the likelihood of restricting the vehicle lateral pose to be within 1m of its Ground Truth (GT) value on the cross-view KITTI dataset has been improved from $35.54\%$ to $76.44\%$, and the likelihood of restricting the vehicle orientation to be within $1^{\circ}$ of its GT value has been improved from $19.64\%$ to $99.10\%$.

Yujiao Shi, Fei Wu, Akhil Perincherry, Ankit Vora, Hongdong Li• 2023

Related benchmarks

Task	Dataset	Result
Location and orientation estimation	VIGOR (Same-Area)	Location Mean Error (m)4.12	42
Location and orientation estimation	VIGOR (Cross-Area)	Location Mean Error (m)5.16	39
Position and Orientation Estimation	KITTI Cross-area	Position Lateral Recall R@1m (%)57.74	23
Cross-View Geolocalization	KITTI Same-Area (test)	Lateral Recall @ 1m76.44	14
Cross-view Localization	KITTI Cross-Area (test)	Lateral Recall @1m (%)57.72	11
Cross-view pose estimation	KITTI Same-area	Location Mean (m)7.87	10
Cross-view yaw estimation	MGL	Accuracy (< 1°)7.4	10
Camera pose estimation	Ford multi-AV (Log2)	Lateral Success Rate @ 1m (%)67.96	9
Camera pose estimation	Ford multi-AV (Log1)	Lateral Success Rate @ 1m67.57	9
Location Estimation	Oxford RobotCar (test1)	Mean Position Error (m)2.4	8

Showing 10 of 18 rows

Other info

Follow for update

@wizwand_team Discord