Attention meets Geometry: Geometry Guided Spatial-Temporal Attention for Consistent Self-Supervised Monocular Depth Estimation
About
Inferring geometrically consistent dense 3D scenes across a tuple of temporally consecutive images remains challenging for self-supervised monocular depth prediction pipelines. This paper explores how the increasingly popular transformer architecture, together with novel regularized loss formulations, can improve depth consistency while preserving accuracy. We propose a spatial attention module that correlates coarse depth predictions to aggregate local geometric information. A novel temporal attention mechanism further processes the local geometric information in a global context across consecutive images. Additionally, we introduce geometric constraints between frames regularized by photometric cycle consistency. By combining our proposed regularization and the novel spatial-temporal-attention module we fully leverage both the geometric and appearance-based consistency across monocular frames. This yields geometrically meaningful attention and improves temporal depth stability and accuracy compared to previous methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Depth Estimation | KITTI (Eigen split) | RMSE3.222 | 276 | |
| Monocular Depth Estimation | KITTI (Eigen split) | Abs Rel0.071 | 193 | |
| Monocular Depth Estimation | DDAD (test) | RMSE15.121 | 122 | |
| Monocular Depth Estimation | KITTI Improved GT (Eigen) | AbsRel0.113 | 92 | |
| Depth Estimation | DDAD (val) | Sq Rel3.788 | 31 | |
| Video Depth Estimation | KITTI (Eigen split) | Delta1 Acc92.1 | 9 | |
| Video Depth Estimation | KITTI | rTC0.901 | 9 |