Unsupervised Scale-consistent Depth Learning from Video
About
We propose a monocular depth estimator SC-Depth, which requires only unlabelled videos for training and enables the scale-consistent prediction at inference time. Our contributions include: (i) we propose a geometry consistency loss, which penalizes the inconsistency of predicted depths between adjacent views; (ii) we propose a self-discovered mask to automatically localize moving objects that violate the underlying static scene assumption and cause noisy signals during training; (iii) we demonstrate the efficacy of each component with a detailed ablation study and show high-quality depth estimation results in both KITTI and NYUv2 datasets. Moreover, thanks to the capability of scale-consistent prediction, we show that our monocular-trained deep networks are readily integrated into the ORB-SLAM2 system for more robust and accurate tracking. The proposed hybrid Pseudo-RGBD SLAM shows compelling results in KITTI, and it generalizes well to the KAIST dataset without additional training. Finally, we provide several demos for qualitative evaluation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | KITTI (Eigen) | Abs Rel0.119 | 502 | |
| Depth Estimation | NYU v2 (test) | Threshold Accuracy (delta < 1.25)81.3 | 423 | |
| Monocular Depth Estimation | KITTI | Abs Rel0.114 | 161 | |
| Monocular Depth Estimation | KITTI Improved GT (Eigen) | AbsRel0.119 | 92 | |
| Depth Estimation | ScanNet (test) | Abs Rel0.169 | 65 | |
| Single-view depth estimation | NYUv2 36 (test) | AbsRel0.159 | 21 | |
| Single-view depth estimation | NYU official 654 images v2 (test) | AbsRel0.159 | 21 | |
| Visual Odometry | KITTI Seq. 10 | Translational Error (%)3.82 | 20 | |
| Visual Odometry | KITTI Seq. 09 | Translation Error (%)5.08 | 20 | |
| Monocular Depth Estimation | DDAD | Abs Rel Error0.169 | 17 |