Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TransMix: Attend to Mix for Vision Transformers

About

Mixup-based augmentation has been found to be effective for generalizing models during training, especially for Vision Transformers (ViTs) since they can easily overfit. However, previous mixup-based methods have an underlying prior knowledge that the linearly interpolated ratio of targets should be kept the same as the ratio proposed in input interpolation. This may lead to a strange phenomenon that sometimes there is no valid object in the mixed image due to the random process in augmentation but there is still response in the label space. To bridge such gap between the input and label spaces, we propose TransMix, which mixes labels based on the attention maps of Vision Transformers. The confidence of the label will be larger if the corresponding input image is weighted higher by the attention map. TransMix is embarrassingly simple and can be implemented in just a few lines of code without introducing any extra parameters and FLOPs to ViT-based models. Experimental results show that our method can consistently improve various ViT-based models at scales on ImageNet classification. After pre-trained with TransMix on ImageNet, the ViT-based models also demonstrate better transferability to semantic segmentation, object detection and instance segmentation. TransMix also exhibits to be more robust when evaluating on 4 different benchmarks. Code will be made publicly available at https://github.com/Beckschen/TransMix.

Jie-Neng Chen, Shuyang Sun, Ju He, Philip Torr, Alan Yuille, Song Bai• 2021

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)--
2454
Image ClassificationImageNet-1k (val)
Top-1 Accuracy81.8
1453
Instance SegmentationCOCO 2017 (val)--
1144
Image ClassificationImageNet-1k (val)
Top-1 Accuracy82.4
840
Image ClassificationImageNet-1k (val)
Top-1 Acc83.8
706
Image ClassificationCIFAR-100 (val)
Accuracy83.7
661
Image ClassificationCIFAR-100
Top-1 Accuracy80.66
622
Fine-grained Image ClassificationStanford Cars (test)
Accuracy91.66
348
Semantic segmentationPASCAL Context (val)
mIoU50.6
323
Image ClassificationStanford Cars (test)--
306
Showing 10 of 21 rows

Other info

Code

Follow for update