Unsupervised High-Resolution Depth Learning From Videos With Dual Networks
About
Unsupervised depth learning takes the appearance difference between a target view and a view synthesized from its adjacent frame as supervisory signal. Since the supervisory signal only comes from images themselves, the resolution of training data significantly impacts the performance. High-resolution images contain more fine-grained details and provide more accurate supervisory signal. However, due to the limitation of memory and computation power, the original images are typically down-sampled during training, which suffers heavy loss of details and disparity accuracy. In order to fully explore the information contained in high-resolution data, we propose a simple yet effective dual networks architecture, which can directly take high-resolution images as input and generate high-resolution and high-accuracy depth map efficiently. We also propose a Self-assembled Attention (SA-Attention) module to handle low-texture region. The evaluation on the benchmark KITTI and Make3D datasets demonstrates that our method achieves state-of-the-art results in the monocular depth estimation task.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | KITTI (Eigen) | Abs Rel0.121 | 502 | |
| Monocular Depth Estimation | KITTI Raw Eigen (test) | RMSE4.945 | 159 | |
| Monocular Depth Estimation | KITTI 2015 (Eigen split) | Abs Rel0.121 | 95 | |
| Depth Prediction | KITTI original ground truth (test) | Abs Rel0.121 | 38 | |
| Monocular Depth Estimation | Make3D C1 metrics up to 70m (test134) | Abs Rel0.318 | 12 |