Training data-efficient image transformers & distillation through attention
About
Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CIFAR-100 (test) | Accuracy90.8 | 3518 | |
| Image Classification | CIFAR-10 (test) | Accuracy97.91 | 3381 | |
| Semantic segmentation | ADE20K (val) | mIoU47.4 | 2731 | |
| Object Detection | COCO 2017 (val) | AP36.9 | 2454 | |
| Semantic segmentation | PASCAL VOC 2012 (val) | Mean IoU53 | 2040 | |
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy83.1 | 1866 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy85.2 | 1453 | |
| Person Re-Identification | Market1501 (test) | Rank-1 Accuracy94.4 | 1264 | |
| Image Classification | ImageNet (val) | Top-1 Acc83.1 | 1206 | |
| Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy (%)84.2 | 1155 |