Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Transformer-Based Attention Networks for Continuous Pixel-Wise Prediction

About

While convolutional neural networks have shown a tremendous impact on various computer vision tasks, they generally demonstrate limitations in explicitly modeling long-range dependencies due to the intrinsic locality of the convolution operation. Initially designed for natural language processing tasks, Transformers have emerged as alternative architectures with innate global self-attention mechanisms to capture long-range dependencies. In this paper, we propose TransDepth, an architecture that benefits from both convolutional neural networks and transformers. To avoid the network losing its ability to capture local-level details due to the adoption of transformers, we propose a novel decoder that employs attention mechanisms based on gates. Notably, this is the first paper that applies transformers to pixel-wise prediction problems involving continuous labels (i.e., monocular depth prediction and surface normal estimation). Extensive experiments demonstrate that the proposed TransDepth achieves state-of-the-art performance on three challenging datasets. Our code is available at: https://github.com/ygjwd12345/TransDepth.

Guanglei Yang, Hao Tang, Mingli Ding, Nicu Sebe, Elisa Ricci• 2021

Related benchmarks

TaskDatasetResultRank
Monocular Depth EstimationKITTI (Eigen)
Abs Rel0.061
502
Depth EstimationNYU v2 (test)
Threshold Accuracy (delta < 1.25)90
423
Depth EstimationKITTI (Eigen split)
RMSE2.755
276
Monocular Depth EstimationNYU v2 (test)
Abs Rel0.106
257
Surface Normal EstimationNYU v2 (test)--
206
Monocular Depth EstimationKITTI (Eigen split)
Abs Rel0.064
193
Depth EstimationNYU Depth V2
RMSE0.365
177
Monocular Depth EstimationKITTI Eigen split (test)
AbsRel Mean0.064
94
Depth EstimationKITTI
AbsRel0.064
92
Monocular Depth EstimationNYU-Depth v2 (official)
Abs Rel0.106
75
Showing 10 of 16 rows

Other info

Code

Follow for update