T2Net: Synthetic-to-Realistic Translation for Solving Single-Image Depth Estimation Tasks
About
Current methods for single-image depth estimation use training datasets with real image-depth pairs or stereo pairs, which are not easy to acquire. We propose a framework, trained on synthetic image-depth pairs and unpaired real images, that comprises an image translation network for enhancing realism of input images, followed by a depth prediction network. A key idea is having the first network act as a wide-spectrum input translator, taking in either synthetic or real images, and ideally producing minimally modified realistic images. This is done via a reconstruction loss when the training input is real, and GAN loss when synthetic, removing the need for heuristic self-regularization. The second network is trained on a task loss for synthetic image-depth pairs, with extra GAN loss to unify real and synthetic feature distributions. Importantly, the framework can be trained end-to-end, leading to good results, even surpassing early deep-learning methods that use real paired data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Estimation | NYU Depth V2 | RMSE0.738 | 177 | |
| Monocular Depth Estimation | Make3D (test) | Abs Rel0.508 | 132 | |
| Monocular Depth Estimation | KITTI 80m maximum depth (Eigen) | Abs Rel0.182 | 126 | |
| Depth Prediction | NYU Depth V2 (test) | Accuracy (δ < 1.25)77.9 | 113 | |
| Monocular Depth Estimation | KITTI 2015 (Eigen split) | Abs Rel0.114 | 95 | |
| Depth Prediction | Cityscapes (test) | RMSE13.922 | 52 | |
| Depth Estimation | KITTI 50m cap (test) | Abs Rel0.168 | 24 | |
| Monocular Depth Estimation | KITTI Raw (KR) Eigen 80m (test) | Abs Rel Error0.174 | 20 | |
| Monocular Depth Estimation | KITTI 50m cap Eigen split (test) | Absolute Relative Error0.148 | 19 | |
| Monocular Depth Estimation | KITTI capped 50m 15 (Eigen) | Abs Rel0.168 | 19 |