Refiner: Refining Self-attention for Vision Transformers
About
Vision Transformers (ViTs) have shown competitive accuracy in image classification tasks compared with CNNs. Yet, they generally require much more data for model pre-training. Most of recent works thus are dedicated to designing more complex architectures or training methods to address the data-efficiency issue of ViTs. However, few of them explore improving the self-attention mechanism, a key factor distinguishing ViTs from CNNs. Different from existing works, we introduce a conceptually simple scheme, called refiner, to directly refine the self-attention maps of ViTs. Specifically, refiner explores attention expansion that projects the multi-head attention maps to a higher-dimensional space to promote their diversity. Further, refiner applies convolutions to augment local patterns of the attention maps, which we show is equivalent to a distributed local attention features are aggregated locally with learnable kernels and then globally aggregated with self-attention. Extensive experiments demonstrate that refiner works surprisingly well. Significantly, it enables ViTs to achieve 86% top-1 classification accuracy on ImageNet with only 81M parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1k (val) | Top-1 Acc85.9 | 706 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy81.2 | 359 | |
| Medical Image Segmentation | ISIC 2018 | Dice Score89.2 | 92 | |
| Image Classification | ImageNet Real 1k (val) | Top-1 Acc90.1 | 64 | |
| Medical Image Segmentation | ACDC | DSC (Avg)91.94 | 48 | |
| Machine Translation | WMT En-Ro 2016 (test) | BLEU34.25 | 39 | |
| Medical Image Segmentation | ICH | DSC83.14 | 25 |