Exploring Localization for Self-supervised Fine-grained Contrastive Learning

About

Self-supervised contrastive learning has demonstrated great potential in learning visual representations. Despite their success in various downstream tasks such as image classification and object detection, self-supervised pre-training for fine-grained scenarios is not fully explored. We point out that current contrastive methods are prone to memorizing background/foreground texture and therefore have a limitation in localizing the foreground object. Analysis suggests that learning to extract discriminative texture information and localization are equally crucial for fine-grained self-supervised pre-training. Based on our findings, we introduce cross-view saliency alignment (CVSA), a contrastive learning framework that first crops and swaps saliency regions of images as a novel view generation and then guides the model to localize on foreground objects via a cross-view alignment loss. Extensive experiments on both small- and large-scale fine-grained classification benchmarks show that CVSA significantly improves the learned representation.

Di Wu, Siyuan Li, Zelin Zang, Stan Z. Li• 2021

Related benchmarks

Task	Dataset	Result
Image Classification	CUB	Accuracy77.1	331
Fine-grained visual classification	NABirds (test)	Top-1 Accuracy77.57	157
Image Classification	iNaturalist 2018 (val)	Top-1 Accuracy47.5	116
Image Classification	NABirds	Accuracy79.64	63
Fine-grained Image Classification	CUB	Top-1 Acc66.88	45
Fine-grained Image Classification	CUB-200 (test)	Accuracy69.14	45
Image Classification	Aircrafts	Top-1 Accuracy87.27	27
Fine-grained Image Classification	NABirds	--	22
Fine-grained Image Classification	Cars	Top-1 Acc77.45	20
Fine grained classification	Cars (test)	Accuracy87.13	13

Showing 10 of 12 rows

Other info

Follow for update

@wizwand_team Discord