Towards Zero-Shot Scale-Aware Monocular Depth Estimation
About
Monocular depth estimation is scale-ambiguous, and thus requires scale supervision to produce metric predictions. Even so, the resulting models will be geometry-specific, with learned scales that cannot be directly transferred across domains. Because of that, recent works focus instead on relative depth, eschewing scale in favor of improved up-to-scale zero-shot transfer. In this work we introduce ZeroDepth, a novel monocular depth estimation framework capable of predicting metric scale for arbitrary test images from different domains and camera parameters. This is achieved by (i) the use of input-level geometric embeddings that enable the network to learn a scale prior over objects; and (ii) decoupling the encoder and decoder stages, via a variational latent representation that is conditioned on single frame information. We evaluated ZeroDepth targeting both outdoor (KITTI, DDAD, nuScenes) and indoor (NYUv2) benchmarks, and achieved a new state-of-the-art in both settings using the same pre-trained model, outperforming methods that train on in-domain data and require test-time scaling to produce metric estimates.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | NYU v2 (test) | Abs Rel0.081 | 257 | |
| Monocular Depth Estimation | KITTI (Eigen split) | Abs Rel0.102 | 193 | |
| Monocular Depth Estimation | DDAD (test) | RMSE6.318 | 122 | |
| Monocular Depth Estimation | KITTI (test) | Abs Rel Error0.064 | 103 | |
| Monocular Depth Estimation | KITTI Eigen split (test) | AbsRel Mean10.2 | 94 | |
| Metric Depth Estimation | KITTI in-domain (test) | Acc (δ < 1.25)96.8 | 27 | |
| Monocular Depth Estimation | Diode Indoor (test) | A.Rel0.309 | 25 | |
| Monocular Depth Estimation | KITTI official (val) | RMSE2.087 | 23 | |
| Monocular Depth Estimation | Virtual KITTI 2 (test) | Delta 1 Acc90.5 | 22 | |
| Monocular Depth Estimation | SUN-RGBD (test) | AbsRel0.121 | 22 |