Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Masked Autoencoders Are Scalable Vision Learners

About

This paper shows that masked autoencoders (MAE) are scalable self-supervised learners for computer vision. Our MAE approach is simple: we mask random patches of the input image and reconstruct the missing pixels. It is based on two core designs. First, we develop an asymmetric encoder-decoder architecture, with an encoder that operates only on the visible subset of patches (without mask tokens), along with a lightweight decoder that reconstructs the original image from the latent representation and mask tokens. Second, we find that masking a high proportion of the input image, e.g., 75%, yields a nontrivial and meaningful self-supervisory task. Coupling these two designs enables us to train large models efficiently and effectively: we accelerate training (by 3x or more) and improve accuracy. Our scalable approach allows for learning high-capacity models that generalize well: e.g., a vanilla ViT-Huge model achieves the best accuracy (87.8%) among methods that use only ImageNet-1K data. Transfer performance in downstream tasks outperforms supervised pre-training and shows promising scaling behavior.

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll\'ar, Ross Girshick• 2021

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU53.6
2731
Object DetectionCOCO 2017 (val)
AP52.4
2454
Semantic segmentationPASCAL VOC 2012 (val)
Mean IoU9.8
2040
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy87.8
1866
Image ClassificationImageNet-1k (val)
Top-1 Accuracy83.6
1453
Semantic segmentationPASCAL VOC 2012 (test)
mIoU75
1342
Image ClassificationImageNet (val)
Top-1 Acc85.9
1206
Visual Question AnsweringVQA v2
Accuracy63.5
1165
ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy (%)85.9
1155
Semantic segmentationCityscapes (test)
mIoU64.7
1145
Showing 10 of 628 rows
...

Other info

Follow for update