SLIP: Self-supervision meets Language-Image Pre-training
About
Recent work has shown that self-supervised pre-training leads to improvements over supervised learning on challenging visual recognition tasks. CLIP, an exciting new approach to learning with language supervision, demonstrates promising performance on a wide variety of benchmarks. In this work, we explore whether self-supervised learning can aid in the use of language supervision for visual representation learning. We introduce SLIP, a multi-task learning framework for combining self-supervised learning and CLIP pre-training. After pre-training with Vision Transformers, we thoroughly evaluate representation quality and compare performance to both CLIP and self-supervised learning under three distinct settings: zero-shot transfer, linear classification, and end-to-end finetuning. Across ImageNet and a battery of additional datasets, we find that SLIP improves accuracy by a large margin. We validate our results further with experiments on different model sizes, training schedules, and pre-training datasets. Our findings show that SLIP enjoys the best of both worlds: better performance than self-supervision (+8.1% linear accuracy) and language supervision (+5.2% zero-shot accuracy).
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU48.5 | 2731 | |
| Instance Segmentation | COCO 2017 (val) | APm0.403 | 1144 | |
| Semantic segmentation | ADE20K | mIoU45.7 | 936 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy34.3 | 840 | |
| Image Classification | ImageNet-1k (val) | Top-1 Accuracy82.6 | 512 | |
| Image Classification | CIFAR-10 | Accuracy79.2 | 507 | |
| Image Classification | EuroSAT | Accuracy84.9 | 497 | |
| Image Classification | Food-101 | Accuracy87.6 | 494 | |
| Image Classification | DTD | Accuracy14.4 | 487 | |
| Image Classification | Stanford Cars | Accuracy85.6 | 477 |