Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction
About
While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution operation. Initially designed for natural language processing tasks, Transformers have emerged as alternative architectures with innate global self-attention mechanisms to capture long-range dependencies. In this paper, we propose TransDepth, an architecture that benefits from both convolutional neural networks and transformers. To avoid the network losing its ability to capture local-level details due to the adoption of transformers, we propose a novel decoder that employs attention mechanisms based on gates. Notably, this is the first paper that applies transformers to pixel-wise prediction problems involving continuous labels (i.e., monocular depth prediction and surface normal estimation). Extensive experiments demonstrate that the proposed TransDepth achieves state-of-the-art performance on three challenging datasets. Our code is available at: https://github.com/ygjwd12345/TransDepth.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Monocular Depth Estimation | KITTI (Eigen) | Abs Rel0.061 | 502 | |
| Depth Estimation | NYU v2 (test) | Threshold Accuracy (delta < 1.25)90 | 423 | |
| Depth Estimation | KITTI (Eigen split) | RMSE2.755 | 276 | |
| Monocular Depth Estimation | NYU v2 (test) | Abs Rel0.106 | 257 | |
| Surface Normal Estimation | NYU v2 (test) | -- | 206 | |
| Monocular Depth Estimation | KITTI (Eigen split) | Abs Rel0.064 | 193 | |
| Depth Estimation | NYU Depth V2 | RMSE0.365 | 177 | |
| Monocular Depth Estimation | KITTI Eigen split (test) | AbsRel Mean0.064 | 94 | |
| Depth Estimation | KITTI | AbsRel0.064 | 92 | |
| Monocular Depth Estimation | NYU-Depth v2 (official) | Abs Rel0.106 | 75 |