Learning Monocular Depth in Dynamic Scenes via Instance-Aware Projection Consistency
About
We present an end-to-end joint training framework that explicitly models 6-DoF motion of multiple dynamic objects, ego-motion and depth in a monocular camera setup without supervision. Our technical contributions are three-fold. First, we highlight the fundamental difference between inverse and forward projection while modeling the individual motion of each rigid object, and propose a geometrically correct projection pipeline using a neural forward projection module. Second, we design a unified instance-aware photometric and geometric consistency loss that holistically imposes self-supervisory signals for every background and object region. Lastly, we introduce a general-purpose auto-annotation scheme using any off-the-shelf instance segmentation and optical flow models to produce video instance segmentation maps that will be utilized as input to our training pipeline. These proposed elements are validated in a detailed ablation study. Through extensive experiments conducted on the KITTI and Cityscapes dataset, our framework is shown to outperform the state-of-the-art depth and motion estimation methods. Our code, dataset, and models are available at https://github.com/SeokjuLee/Insta-DM .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | KITTI (Eigen) | Abs Rel0.112 | 502 | |
| Monocular Depth Estimation | KITTI | Abs Rel0.112 | 161 | |
| Monocular Depth Estimation | KITTI Raw Eigen (test) | RMSE4.547 | 159 | |
| Monocular Depth Estimation | Cityscapes | Accuracy (delta < 1.25)86.8 | 62 | |
| Depth Prediction | Cityscapes (test) | RMSE6.437 | 52 | |
| Monocular Depth Estimation | KITTI Eigen (test) | AbsRel0.178 | 46 | |
| Depth Prediction | KITTI original ground truth (test) | Abs Rel0.112 | 38 | |
| Depth Prediction | KITTI original (Eigen split) | Abs Rel0.112 | 29 | |
| Single-view depth estimation | KITTI 33 | AbsRel0.112 | 16 | |
| Visual Odometry | KITTI Odometry raw (Sequence 09) | t_err (%)8.6 | 16 |