Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Training data-efficient image transformers & distillation through attention

About

Recently, neural networks purely based on attention were shown to address image understanding tasks such as image classification. However, these visual transformers are pre-trained with hundreds of millions of images using an expensive infrastructure, thereby limiting their adoption. In this work, we produce a competitive convolution-free transformer by training on Imagenet only. We train them on a single computer in less than 3 days. Our reference vision transformer (86M parameters) achieves top-1 accuracy of 83.1% (single-crop evaluation) on ImageNet with no external data. More importantly, we introduce a teacher-student strategy specific to transformers. It relies on a distillation token ensuring that the student learns from the teacher through attention. We show the interest of this token-based distillation, especially when using a convnet as a teacher. This leads us to report results competitive with convnets for both Imagenet (where we obtain up to 85.2% accuracy) and when transferring to other tasks. We share our code and models.

Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, Herv\'e J\'egou• 2020

Related benchmarks

TaskDatasetResultRank
Image ClassificationCIFAR-100 (test)
Accuracy90.8
3518
Image ClassificationCIFAR-10 (test)
Accuracy97.91
3381
Semantic segmentationADE20K (val)
mIoU47.4
2731
Object DetectionCOCO 2017 (val)
AP36.9
2454
Semantic segmentationPASCAL VOC 2012 (val)
Mean IoU53
2040
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy83.1
1866
Image ClassificationImageNet-1k (val)
Top-1 Accuracy85.2
1453
Person Re-IdentificationMarket1501 (test)
Rank-1 Accuracy94.4
1264
Image ClassificationImageNet (val)
Top-1 Acc83.1
1206
ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy (%)84.2
1155
Showing 10 of 315 rows
...

Other info

Code

Follow for update