Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Exploring Localization for Self-supervised Fine-grained Contrastive Learning

About

Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success in various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. We point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for fine-grained self-supervised pre-training. Based on our findings, we introduce cross-view saliency alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on foreground objects via a cross-view alignment loss. Extensive experiments on both small- and large-scale fine-grained classification benchmarks show that CVSA significantly improves the learned representation.

Di Wu, Siyuan Li, Zelin Zang, Stan Z. Li• 2021

Related benchmarks

TaskDatasetResultRank
Image ClassificationCUB
Accuracy77.1
249
Fine-grained visual classificationNABirds (test)
Top-1 Accuracy77.57
157
Image ClassificationiNaturalist 2018 (val)
Top-1 Accuracy47.5
116
Fine-grained Image ClassificationCUB-200 (test)
Accuracy69.14
45
Image ClassificationNABirds
Accuracy79.64
37
Image ClassificationAircrafts
Top-1 Accuracy87.27
27
Fine-grained Image ClassificationCUB
Top-1 Acc66.88
22
Fine-grained Image ClassificationNABirds--
22
Fine-grained Image ClassificationCars
Top-1 Acc77.45
20
Fine grained classificationCars (test)
Accuracy87.13
13
Showing 10 of 12 rows

Other info

Follow for update