Masked Siamese Networks for Label-Efficient Learning
About
We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are processed by the network. As a result, MSNs improve the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification. For instance, on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark. Our code is publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet-1k (val) | -- | 1453 | |
| Video Object Segmentation | DAVIS 2017 (val) | J mean57.6 | 1130 | |
| Semantic segmentation | ADE20K | mIoU26.66 | 936 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy62.8 | 840 | |
| Semantic segmentation | Cityscapes | mIoU25.39 | 578 | |
| Image Classification | Food-101 | Accuracy68.93 | 494 | |
| Semantic segmentation | Pascal VOC | mIoU0.6859 | 172 | |
| Image Classification | Oxford-IIIT Pet | Accuracy75.91 | 161 | |
| Image Classification | iNaturalist 18 | Overall Accuracy72.1 | 125 | |
| Image Retrieval | Revisited Oxford (ROxf) (Medium) | mAP36.6 | 124 |