Learning the Depths of Moving People by Watching Frozen People

About

We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and may only recover sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene. Because people are stationary, training data can be generated using multi-view stereo reconstruction. At inference time, our method uses motion parallax cues from the static areas of the scenes to guide the depth prediction. We demonstrate our method on real-world sequences of complex human actions captured by a moving hand-held camera, show improvement over state-of-the-art monocular depth prediction methods, and show various 3D effects produced using our predicted depth.

Zhengqi Li, Tali Dekel, Forrester Cole, Richard Tucker, Noah Snavely, Ce Liu, William T. Freeman• 2019

Related benchmarks

Task	Dataset	Result
Depth Prediction	ETH3D	AbsRel18.1	37
Depth Prediction	Sintel	AbsRel0.385	32
Monocular Depth Estimation	DIW	WHDR23.15	19
Monocular Depth Estimation	TUM	Accuracy (delta <= 1.25)29.54	14
Monocular Depth Estimation	NYU	Threshold Error (delta > 1.25)18.57	9
Monocular Depth Estimation	KITTI	Error Rate (> 1.25)36.29	9

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord