Learning to Recover 3D Scene Shape from a Single Image
About
Despite significant progress in monocular depth estimation in the wild, recent state-of-the-art methods cannot be used to recover accurate 3D scene shape due to an unknown depth shift induced by shift-invariant reconstruction losses used in mixed-data depth prediction training, and possible unknown camera focal length. We investigate this problem in detail, and propose a two-stage framework that first predicts depth up to an unknown scale and shift from a single monocular image, and then use 3D point cloud encoders to predict the missing depth shift and focal length that allow us to recover a realistic 3D scene shape. In addition, we propose an image-level normalized regression loss and a normal-based geometry loss to enhance depth prediction models trained on mixed datasets. We test our depth model on nine unseen datasets and achieve state-of-the-art performance on zero-shot dataset generalization. Code is available at: https://git.io/Depth
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | KITTI (Eigen) | Abs Rel14.9 | 502 | |
| Depth Estimation | NYU v2 (test) | Threshold Accuracy (delta < 1.25)91.6 | 423 | |
| Monocular Depth Estimation | NYU v2 (test) | Abs Rel9 | 257 | |
| Monocular Depth Estimation | KITTI (Eigen split) | Abs Rel0.149 | 193 | |
| Monocular Depth Estimation | KITTI | Abs Rel0.149 | 161 | |
| Monocular Depth Estimation | ETH3D | AbsRel17.1 | 117 | |
| Monocular Depth Estimation | NYU V2 | Delta 1 Acc91.6 | 113 | |
| Monocular Depth Estimation | KITTI (test) | Abs Rel Error14.9 | 103 | |
| Depth Estimation | ScanNet | AbsRel9.6 | 94 | |
| Monocular Depth Estimation | DIODE | AbsRel27.1 | 93 |