Vision Transformers for Dense Prediction
About
We introduce dense vision transformers, an architecture that leverages vision transformers in place of convolutional networks as a backbone for dense prediction tasks. We assemble tokens from various stages of the vision transformer into image-like representations at various resolutions and progressively combine them into full-resolution predictions using a convolutional decoder. The transformer backbone processes representations at a constant and relatively high resolution and has a global receptive field at every stage. These properties allow the dense vision transformer to provide finer-grained and more globally coherent predictions when compared to fully-convolutional networks. Our experiments show that this architecture yields substantial improvements on dense prediction tasks, especially when a large amount of training data is available. For monocular depth estimation, we observe an improvement of up to 28% in relative performance when compared to a state-of-the-art fully-convolutional network. When applied to semantic segmentation, dense vision transformers set a new state of the art on ADE20K with 49.02% mIoU. We further show that the architecture can be fine-tuned on smaller datasets such as NYUv2, KITTI, and Pascal Context where it also sets the new state of the art. Our models are available at https://github.com/intel-isl/DPT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU49.02 | 2731 | |
| Semantic segmentation | ADE20K | mIoU49.2 | 936 | |
| Monocular Depth Estimation | KITTI (Eigen) | Abs Rel0.062 | 502 | |
| Depth Estimation | NYU v2 (test) | Threshold Accuracy (delta < 1.25)90.4 | 423 | |
| Semantic segmentation | PASCAL Context (val) | mIoU60.5 | 323 | |
| Depth Estimation | KITTI (Eigen split) | RMSE2.573 | 276 | |
| Monocular Depth Estimation | NYU v2 (test) | Abs Rel0.094 | 257 | |
| Monocular Depth Estimation | KITTI (Eigen split) | Abs Rel0.052 | 193 | |
| Depth Estimation | NYU Depth V2 | -- | 177 | |
| Semantic segmentation | Pascal Context (test) | mIoU60.46 | 176 |