HRFormer: High-Resolution Transformer for Dense Prediction
About
We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms Swin transformer by $1.3$ AP on COCO pose estimation with $50\%$ fewer parameters and $30\%$ fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU50 | 2731 | |
| Semantic segmentation | ADE20K | -- | 936 | |
| Image Classification | ImageNet-1k (val) | Top-1 Acc82.8 | 706 | |
| Semantic segmentation | Cityscapes | -- | 578 | |
| Human Pose Estimation | COCO (test-dev) | AP76.2 | 408 | |
| 2D Human Pose Estimation | COCO 2017 (val) | AP77.2 | 386 | |
| Pose Estimation | COCO (val) | AP77.2 | 319 | |
| Semantic segmentation | Cityscapes (val) | mIoU83.2 | 287 | |
| Semantic segmentation | COCO Stuff | -- | 195 | |
| Human Pose Estimation | COCO 2017 (test-dev) | AP76.2 | 180 |