Semi-supervised Vision Transformers at Scale
About
We study semi-supervised learning (SSL) for vision transformers (ViT), an under-explored topic despite the wide adoption of the ViT architectures to different tasks. To tackle this problem, we propose a new SSL pipeline, consisting of first un/self-supervised pre-training, followed by supervised fine-tuning, and finally semi-supervised fine-tuning. At the semi-supervised fine-tuning stage, we adopt an exponential moving average (EMA)-Teacher framework instead of the popular FixMatch, since the former is more stable and delivers higher accuracy for semi-supervised vision transformers. In addition, we propose a probabilistic pseudo mixup mechanism to interpolate unlabeled samples and their pseudo labels for improved regularization, which is important for training ViTs with weak inductive bias. Our proposed method, dubbed Semi-ViT, achieves comparable or better performance than the CNN counterparts in the semi-supervised classification setting. Semi-ViT also enjoys the scalability benefits of ViTs that can be readily scaled up to large-size models with increasing accuracies. For example, Semi-ViT-Huge achieves an impressive 80% top-1 accuracy on ImageNet using only 1% labels, which is comparable with Inception-v4 using 100% ImageNet labels.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU49.9 | 2731 | |
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy83.1 | 1866 | |
| Instance Segmentation | COCO 2017 (val) | -- | 1144 | |
| Image Classification | Stanford Cars | Accuracy92.1 | 477 | |
| Image Classification | ImageNet 1% labeled | -- | 118 | |
| Image Classification | ImageNet (10% labels) | Top-1 Acc84.3 | 98 | |
| Image Classification | ImageNet 1k (10% labels) | Top-1 Acc84.3 | 92 | |
| Image Classification | Food-101 (test) | -- | 89 | |
| Image Classification | ImageNet 1k (1%) | Top-1 Acc80 | 49 |