HRFormer: High-Resolution Transformer for Dense Prediction

About

We present a High-Resolution Transformer (HRFormer) that learns high-resolution representations for dense prediction tasks, in contrast to the original Vision Transformer that produces low-resolution representations and has high memory and computational cost. We take advantage of the multi-resolution parallel design introduced in high-resolution convolutional networks (HRNet), along with local-window self-attention that performs self-attention over small non-overlapping image windows, for improving the memory and computation efficiency. In addition, we introduce a convolution into the FFN to exchange information across the disconnected image windows. We demonstrate the effectiveness of the High-Resolution Transformer on both human pose estimation and semantic segmentation tasks, e.g., HRFormer outperforms Swin transformer by $1.3$ AP on COCO pose estimation with $50\%$ fewer parameters and $30\%$ fewer FLOPs. Code is available at: https://github.com/HRNet/HRFormer.

Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, Jingdong Wang• 2021

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU50	3069
Semantic segmentation	ADE20K	--	1028
Image Classification	ImageNet-1k (val)	Top-1 Acc82.8	706
Semantic segmentation	Cityscapes	--	668
Human Pose Estimation	COCO (test-dev)	AP76.2	432
Semantic segmentation	COCO Stuff	--	399
2D Human Pose Estimation	COCO 2017 (val)	AP77.2	386
Pose Estimation	COCO (val)	AP77.2	319
Semantic segmentation	Cityscapes (val)	mIoU83.2	301
Multi-person Pose Estimation	CrowdPose (test)	AP72.4	202

Showing 10 of 35 rows

Other info

Code

Follow for update

@wizwand_team Discord