Exploring Localization for Self-supervised Fine-grained Contrastive Learning
About
Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success in various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. We point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for fine-grained self-supervised pre-training. Based on our findings, we introduce cross-view saliency alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on foreground objects via a cross-view alignment loss. Extensive experiments on both small- and large-scale fine-grained classification benchmarks show that CVSA significantly improves the learned representation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | CUB | Accuracy77.1 | 249 | |
| Fine-grained visual classification | NABirds (test) | Top-1 Accuracy77.57 | 157 | |
| Image Classification | iNaturalist 2018 (val) | Top-1 Accuracy47.5 | 116 | |
| Fine-grained Image Classification | CUB-200 (test) | Accuracy69.14 | 45 | |
| Image Classification | NABirds | Accuracy79.64 | 37 | |
| Image Classification | Aircrafts | Top-1 Accuracy87.27 | 27 | |
| Fine-grained Image Classification | CUB | Top-1 Acc66.88 | 22 | |
| Fine-grained Image Classification | NABirds | -- | 22 | |
| Fine-grained Image Classification | Cars | Top-1 Acc77.45 | 20 | |
| Fine grained classification | Cars (test) | Accuracy87.13 | 13 |