ResT: An Efficient Transformer for Visual Recognition
About
This paper presents an efficient multi-scale vision Transformer, called ResT, that capably served as a general-purpose backbone for image recognition. Unlike existing Transformer methods, which employ standard Transformer blocks to tackle raw images with a fixed resolution, our ResT have several advantages: (1) A memory-efficient multi-head self-attention is built, which compresses the memory by a simple depth-wise convolution, and projects the interaction across the attention-heads dimension while keeping the diversity ability of multi-heads; (2) Position encoding is constructed as spatial attention, which is more flexible and can tackle with input images of arbitrary size without interpolation or fine-tune; (3) Instead of the straightforward tokenization at the beginning of each stage, we design the patch embedding as a stack of overlapping convolution operation with stride on the 2D-reshaped token map. We comprehensively validate ResT on image classification and downstream tasks. Experimental results show that the proposed ResT can outperform the recently state-of-the-art backbones by a large margin, demonstrating the potential of ResT as strong backbones. The code and models will be made publicly available at https://github.com/wofmanaf/ResT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | AP40.3 | 2454 | |
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy81.6 | 1866 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy83.6 | 1453 | |
| Image Classification | ImageNet (val) | Top-1 Acc81.6 | 1206 | |
| Instance Segmentation | COCO 2017 (val) | APm0.372 | 1144 | |
| Object Detection | MS-COCO 2017 (val) | -- | 237 | |
| Image Classification | ImageNet-1K | Top-1 Accuracy81.6 | 78 | |
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy77.2 | 48 | |
| Image Classification | ImageNet-1k 1.0 (test val) | Top-1 Acc79.6 | 24 |