Revisiting Weakly Supervised Pre-Training of Visual Perception Models
About
Model pre-training is a cornerstone of modern visual recognition systems. Although fully supervised pre-training on datasets like ImageNet is still the de-facto standard, recent studies suggest that large-scale weakly supervised pre-training can outperform fully supervised approaches. This paper revisits weakly-supervised pre-training of models using hashtag supervision with modern versions of residual networks and the largest-ever dataset of images and corresponding hashtags. We study the performance of the resulting models in various transfer-learning settings including zero-shot transfer. We also compare our models with those obtained via large-scale self-supervised learning. We find our weakly-supervised models to be very competitive across all settings, and find they substantially outperform their self-supervised counterparts. We also include an investigation into whether our models learned potentially troubling associations or stereotypes. Overall, our results provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems. Our models, Supervised Weakly through hashtAGs (SWAG), are available publicly.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | -- | 2454 | |
| Image Classification | ImageNet-1k (val) | -- | 1453 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy85.3 | 798 | |
| Image Classification | ImageNet-1k (val) | Top-1 Acc88.6 | 706 | |
| Image Classification | ImageNet A | Top-1 Acc92.6 | 553 | |
| Object Detection | LVIS v1.0 (val) | APbbox47.1 | 518 | |
| Image Classification | Food-101 | Accuracy96.9 | 494 | |
| Image Classification | ImageNet V2 | Top-1 Acc81.1 | 487 | |
| Image Classification | Stanford Cars | Accuracy94.7 | 477 | |
| Image Classification | ImageNet-Sketch | Top-1 Accuracy83.7 | 360 |