Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Vision Transformers for Dense Prediction

About

We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.

Ren\'e Ranftl, Alexey Bochkovskiy, Vladlen Koltun• 2021

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU49.02
2888
Semantic segmentationADE20K
mIoU49.2
1024
Monocular Depth EstimationKITTI (Eigen)
Abs Rel0.062
523
Depth EstimationNYU v2 (test)
Threshold Accuracy (delta < 1.25)90.4
432
Semantic segmentationPASCAL Context (val)
mIoU60.5
360
Monocular Depth EstimationNYU v2 (test)
Abs Rel0.094
300
Depth EstimationKITTI (Eigen split)
RMSE2.573
291
Monocular Depth EstimationKITTI (Eigen split)
Abs Rel0.052
215
Depth EstimationNYU Depth V2--
209
Monocular Depth EstimationKITTI
Abs Rel0.06
203
Showing 10 of 131 rows
...

Other info

Code

Follow for update