Unleashing the Power of Contrastive Self-Supervised Visual Models via Contrast-Regularized Fine-Tuning
About
Contrastive self-supervised learning (CSL) has attracted increasing attention for model pre-training via unlabeled data. The resulted CSL models provide instance-discriminative visual features that are uniformly scattered in the feature space. During deployment, the common practice is to directly fine-tune CSL models with cross-entropy, which however may not be the best strategy in practice. Although cross-entropy tends to separate inter-class features, the resulting models still have limited capability for reducing intra-class feature scattering that exists in CSL models. In this paper, we investigate whether applying contrastive learning to fine-tuning would bring further benefits, and analytically find that optimizing the contrastive loss benefits both discriminative representation learning and model optimization during fine-tuning. Inspired by these findings, we propose Contrast-regularized tuning (Core-tuning), a new approach for fine-tuning CSL models. Instead of simply adding the contrastive loss to the objective of fine-tuning, Core-tuning further applies a novel hard pair mining strategy for more effective contrastive fine-tuning, as well as smoothing the decision boundary to better exploit the learned discriminative feature space. Extensive experiments on image classification and semantic segmentation verify the effectiveness of Core-tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | PASCAL VOC 2012 (val) | Mean IoU79.62 | 2040 | |
| Image Classification | CIFAR-100 | Top-1 Accuracy84.13 | 622 | |
| Image Classification | DTD | Accuracy75.37 | 487 | |
| Image Classification | CIFAR-10 | -- | 471 | |
| Image Classification | ImageNet | Top-1 Accuracy77.43 | 429 | |
| Image Classification | Aircraft | Accuracy89.48 | 302 | |
| Image Classification | iNaturalist 2018 | Top-1 Accuracy63.57 | 287 | |
| Image Classification | Oxford-IIIT Pets | Accuracy92.36 | 259 | |
| Image Classification | PACS (test) | Average Accuracy88.08 | 254 | |
| Semantic segmentation | Pascal VOC (test) | mIoU79.62 | 236 |