Learning Features by Watching Objects Move

About

This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as 'pseudo ground truth' to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

Deepak Pathak, Ross Girshick, Piotr Doll\'ar, Trevor Darrell, Bharath Hariharan• 2016

Related benchmarks

Task	Dataset	Result
Object Detection	COCO 2017 (val)	AP32.3	2843
Image Classification	ImageNet (val)	Top-1 Acc27.62	1206
Object Detection	PASCAL VOC 2007 (test)	mAP61.13	844
Depth Estimation	NYU v2 (test)	Threshold Accuracy (delta < 1.25)74.2	435
Image Classification	ImageNet (val)	Top-1 Accuracy27.6	354
Classification	PASCAL VOC 2007 (test)	mAP (%)61	217
Object Detection	PASCAL VOC 2007	mAP52.2	49
Perceptual Similarity	BAPPS (val)	2AFC (Overall)67.2	39
Image Classification	VTAB v2 (test)	Mean Accuracy47.1	39
Video Object Segmentation	DAVIS (val)	--	28

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord