Revisiting Weakly Supervised Pre-Training of Visual Perception Models
About
Model pre-training is a cornerstone of modern visual recognition systems. Although fully supervised pre-training on datasets like ImageNet is still the de-facto standard, recent studies suggest that large-scale weakly supervised pre-training can outperform fully supervised approaches. This paper revisits weakly-supervised pre-training of models using hashtag supervision with modern versions of residual networks and the largest-ever dataset of images and corresponding hashtags. We study the performance of the resulting models in various transfer-learning settings including zero-shot transfer. We also compare our models with those obtained via large-scale self-supervised learning. We find our weakly-supervised models to be very competitive across all settings, and find they substantially outperform their self-supervised counterparts. We also include an investigation into whether our models learned potentially troubling associations or stereotypes. Overall, our results provide a compelling argument for the use of weakly supervised learning in the development of visual recognition systems. Our models, Supervised Weakly through hashtAGs (SWAG), are available publicly.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Object Detection | COCO 2017 (val) | -- | 2643 | |
| Image Classification | ImageNet-1k (val) | -- | 1469 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy85.3 | 848 | |
| Image Classification | ImageNet-1k (val) | Top-1 Acc88.6 | 706 | |
| Image Classification | ImageNet A | Top-1 Acc92.6 | 654 | |
| Image Classification | Stanford Cars | Accuracy94.7 | 635 | |
| Image Classification | ImageNet V2 | Top-1 Acc81.1 | 611 | |
| Image Classification | Food-101 | Accuracy96.9 | 542 | |
| Object Detection | LVIS v1.0 (val) | APbbox47.1 | 529 | |
| Image Classification | SUN397 | Accuracy96.9 | 441 |