Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unsupervised Visual Representation Learning by Synchronous Momentum Grouping

About

In this paper, we propose a genuine group-level contrastive visual representation learning method whose linear evaluation performance on ImageNet surpasses the vanilla supervised learning. Two mainstream unsupervised learning schemes are the instance-level contrastive framework and clustering-based schemes. The former adopts the extremely fine-grained instance-level discrimination whose supervisory signal is not efficient due to the false negatives. Though the latter solves this, they commonly come with some restrictions affecting the performance. To integrate their advantages, we design the SMoG method. SMoG follows the framework of contrastive learning but replaces the contrastive unit from instance to group, mimicking clustering-based methods. To achieve this, we propose the momentum grouping scheme which synchronously conducts feature grouping with representation learning. In this way, SMoG solves the problem of supervisory signal hysteresis which the clustering-based method usually faces, and reduces the false negatives of instance contrastive methods. We conduct exhaustive experiments to show that SMoG works well on both CNN and Transformer backbones. Results prove that SMoG has surpassed the current SOTA unsupervised representation learning methods. Moreover, its linear evaluation results surpass the performances obtained by vanilla supervised learning and the representation can be well transferred to downstream tasks.

Bo Pang, Yifan Zhang, Yaoyi Li, Jia Cai, Cewu Lu• 2022

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet (val)
Top-1 Acc77.7
1206
Semantic segmentationCityscapes
mIoU76.03
578
Object DetectionCOCO standard 2017 (train val)
AP (IoU 0.5:0.95)40.1
64
Image ClassificationImageNet 1.0 (10% labeled)--
33
Image ClassificationImageNet-1K 1.0 (1% labels)
Top-1 Acc63.6
28
Instance SegmentationCOCO 2017 (train/val)
AP (Mask)36.9
21
Semantic segmentationVOC 2012
mIoU76.22
18
Image ClassificationImageNet 100% labels 1.0
Top-1 Accuracy80.2
7
Showing 8 of 8 rows

Other info

Follow for update