Multi-task Self-Supervised Visual Learning

About

We investigate methods for combining multiple self-supervised tasks--i.e., supervised tasks where data can be collected without manual labeling--in order to train a single visual representation. First, we provide an apples-to-apples comparison of four different self-supervised tasks using the very deep ResNet-101 architecture. We then combine tasks to jointly train a network. We also explore lasso regularization to encourage the network to factorize the information in its representation, and methods for "harmonizing" network inputs in order to learn a more unified representation. We evaluate all methods on ImageNet classification, PASCAL VOC detection, and NYU depth prediction. Our results show that deeper networks work better, and that combining tasks--even via a naive multi-head architecture--always improves performance. Our best joint network nearly matches the PASCAL performance of a model pre-trained on ImageNet classification, and matches the ImageNet network on NYU depth prediction.

Carl Doersch, Andrew Zisserman• 2017

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP32.7	2843
Image Classification	ImageNet-1k (val)	Top-1 Accuracy31.5	1498
Image Classification	ImageNet (val)	--	1206
Object Detection	PASCAL VOC 2007 (test)	mAP70.5	844
Depth Estimation	NYU v2 (test)	Threshold Accuracy (delta < 1.25)79.3	435
Image Classification	ImageNet (val)	--	354
Image Classification	ImageNet	--	55
Image Classification	VTAB v2 (test)	Mean Accuracy59.2	39
Depth Prediction	NYU Depth	--	5

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord