Learning the Depths of Moving People by Watching Frozen People
About
We present a method for predicting dense depth in scenarios where both a monocular camera and people in the scene are freely moving. Existing methods for recovering depth for dynamic, non-rigid objects from monocular video impose strong assumptions on the objects' motion and may only recover sparse depth. In this paper, we take a data-driven approach and learn human depth priors from a new source of data: thousands of Internet videos of people imitating mannequins, i.e., freezing in diverse, natural poses, while a hand-held camera tours the scene. Because people are stationary, training data can be generated using multi-view stereo reconstruction. At inference time, our method uses motion parallax cues from the static areas of the scenes to guide the depth prediction. We demonstrate our method on real-world sequences of complex human actions captured by a moving hand-held camera, show improvement over state-of-the-art monocular depth prediction methods, and show various 3D effects produced using our predicted depth.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Prediction | ETH3D | AbsRel18.1 | 35 | |
| Depth Prediction | Sintel | AbsRel0.385 | 32 | |
| Monocular Depth Estimation | DIW | WHDR23.15 | 19 | |
| Monocular Depth Estimation | TUM | Accuracy (delta <= 1.25)29.54 | 9 | |
| Monocular Depth Estimation | NYU | Threshold Error (delta > 1.25)18.57 | 9 | |
| Monocular Depth Estimation | KITTI | Error Rate (> 1.25)36.29 | 9 |