On the Surprising Effectiveness of Attention Transfer for Vision Transformers
About
Conventional wisdom suggests that pre-training Vision Transformers (ViT) improves downstream performance by learning useful representations. Is this actually true? We investigate this question and find that the features and representations learned during pre-training are not essential. Surprisingly, using only the attention patterns from pre-training (i.e., guiding how information flows between tokens) is sufficient for models to learn high quality features from scratch and achieve comparable downstream performance. We show this by introducing a simple method called attention transfer, where only the attention patterns from a pre-trained teacher ViT are transferred to a student, either by copying or distilling the attention maps. Since attention transfer lets the student learn its own features, ensembling it with a fine-tuned teacher also further improves accuracy on ImageNet. We systematically study various aspects of our findings on the sufficiency of attention maps, including distribution shift settings where they underperform fine-tuning. We hope our exploration provides a better understanding of what pre-training accomplishes and leads to a useful alternative to the standard practice of fine-tuning
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy86.3 | 1866 | |
| Instance Segmentation | COCO 2017 (val) | -- | 1144 | |
| Image Classification | ImageNet A | Top-1 Acc54.3 | 553 | |
| Image Classification | ImageNet V2 | -- | 487 | |
| Image Classification | ImageNet-R | Accuracy57.5 | 148 | |
| Image Classification | ImageNet-S | Top-1 Acc43.1 | 43 | |
| Long-tailed Visual Recognition | iNaturalist 2017 (test) | Accuracy69.3 | 16 | |
| Long-tailed recognition | iNaturalist 2018 | -- | 7 | |
| Long-tail recognition | iNat 2019 | Accuracy80 | 4 |