Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction
About
While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution operation. Initially designed for natural language processing tasks, Transformers have emerged as alternative architectures with innate global self-attention mechanisms to capture long-range dependencies. In this paper, we propose TransDepth, an architecture that benefits from both convolutional neural networks and transformers. To avoid the network losing its ability to capture local-level details due to the adoption of transformers, we propose a novel decoder that employs attention mechanisms based on gates. Notably, this is the first paper that applies transformers to pixel-wise prediction problems involving continuous labels (i.e., monocular depth prediction and surface normal estimation). Extensive experiments demonstrate that the proposed TransDepth achieves state-of-the-art performance on three challenging datasets. Our code is available at: https://github.com/ygjwd12345/TransDepth.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | KITTI (Eigen) | Abs Rel0.061 | 523 | |
| Depth Estimation | NYU v2 (test) | Threshold Accuracy (delta < 1.25)90 | 432 | |
| Monocular Depth Estimation | NYU v2 (test) | Abs Rel0.106 | 300 | |
| Depth Estimation | KITTI (Eigen split) | RMSE2.755 | 291 | |
| Surface Normal Estimation | NYU v2 (test) | -- | 224 | |
| Monocular Depth Estimation | KITTI (Eigen split) | Abs Rel0.064 | 215 | |
| Depth Estimation | NYU Depth V2 | RMSE0.365 | 209 | |
| Depth Estimation | KITTI | AbsRel0.064 | 106 | |
| Monocular Depth Estimation | KITTI Eigen split (test) | AbsRel Mean0.064 | 100 | |
| Monocular Depth Estimation | NYU-Depth v2 (official) | Abs Rel0.106 | 75 |