Unsupervised Object-Level Representation Learning from Scene Images

About

Contrastive self-supervised learning has largely narrowed the gap to supervised pre-training on ImageNet. However, its success highly relies on the object-centric priors of ImageNet, i.e., different augmented views of the same image correspond to the same object. Such a heavily curated constraint becomes immediately infeasible when pre-trained on more complex scene images with many objects. To overcome this limitation, we introduce Object-level Representation Learning (ORL), a new self-supervised learning framework towards scene images. Our key insight is to leverage image-level self-supervised pre-training as the prior to discover object-level semantic correspondence, thus realizing object-level representation learning from scene images. Extensive experiments on COCO show that ORL significantly improves the performance of self-supervised learning on scene images, even surpassing supervised ImageNet pre-training on several downstream tasks. Furthermore, ORL improves the downstream performance when more unlabeled scene images are available, demonstrating its great potential of harnessing unlabeled data in the wild. We hope our approach can motivate future research on more general-purpose unsupervised representation learning from scene data.

Jiahao Xie, Xiaohang Zhan, Ziwei Liu, Yew Soon Ong, Chen Change Loy• 2021

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU36.7	3089
Object Detection	COCO 2017 (val)	--	2930
Semantic segmentation	PASCAL VOC 2012 (val)	Mean IoU70.9	2210
Image Classification	ImageNet-1k (val)	Top-1 Accuracy60.7	1498
Instance Segmentation	COCO 2017 (val)	APm0.363	1304
Object Detection	COCO v2017 (test-dev)	mAP40.3	499
Image Classification	ImageNet (val)	Top-1 Accuracy60.9	354
Instance Segmentation	COCO	APmask36.7	301
Object Detection	COCO	AP50 (Box)60.8	237
Semantic segmentation	COCO Stuff (val)	mIoU45.6	173

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord