Unsupervised Learning of Depth and Ego-Motion from Video

About

We present an unsupervised learning framework for the task of monocular depth and camera motion estimation from unstructured video sequences. We achieve this by simultaneously training depth and camera pose estimation networks using the task of view synthesis as the supervisory signal. The networks are thus coupled via the view synthesis objective during training, but can be applied independently at test time. Empirical evaluation on the KITTI dataset demonstrates the effectiveness of our approach: 1) monocular depth performing comparably with supervised methods that use either ground-truth pose or depth for training, and 2) pose estimation performing favorably with established SLAM systems under comparable input settings.

Tinghui Zhou, Matthew Brown, Noah Snavely, David G. Lowe• 2017

Related benchmarks

Task	Dataset	Result
Monocular Depth Estimation	KITTI (Eigen)	Abs Rel0.183	523
Depth Estimation	NYU v2 (test)	Threshold Accuracy (delta < 1.25)67.4	435
Depth Estimation	KITTI (Eigen split)	RMSE6.709	291
Surface Normal Estimation	NYU v2 (test)	Mean Angle Distance (MAD)43.5	224
Monocular Depth Estimation	KITTI	Abs Rel0.208	220
Monocular Depth Estimation	KITTI Raw Eigen (test)	RMSE4.975	159
Depth Estimation	KITTI	RMSE5.294	156
Camera pose estimation	ScanNet	--	133
Monocular Depth Estimation	Make3D (test)	Abs Rel0.383	132
Monocular Depth Estimation	KITTI 80m maximum depth (Eigen)	Abs Rel0.121	126

Showing 10 of 51 rows

Other info

Code

Follow for update

@wizwand_team Discord