Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Features by Watching Objects Move

About

This paper presents a novel yet intuitive approach to unsupervised feature learning. Inspired by the human visual system, we explore whether low-level motion-based grouping cues can be used to learn an effective visual representation. Specifically, we use unsupervised motion-based segmentation on videos to obtain segments, which we use as 'pseudo ground truth' to train a convolutional network to segment objects from a single frame. Given the extensive evidence that motion plays a key role in the development of the human visual system, we hope that this straightforward approach to unsupervised learning will be more effective than cleverly designed 'pretext' tasks studied in the literature. Indeed, our extensive experiments show that this is the case. When used for transfer learning on object detection, our representation significantly outperforms previous unsupervised approaches across multiple settings, especially when training data for the target task is scarce.

Deepak Pathak, Ross Girshick, Piotr Doll\'ar, Trevor Darrell, Bharath Hariharan• 2016

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP32.3
2454
Image ClassificationImageNet (val)
Top-1 Acc27.62
1206
Object DetectionPASCAL VOC 2007 (test)
mAP61.13
821
Depth EstimationNYU v2 (test)
Threshold Accuracy (delta < 1.25)74.2
423
Image ClassificationImageNet (val)
Top-1 Accuracy27.6
354
ClassificationPASCAL VOC 2007 (test)
mAP (%)61
217
Object DetectionPASCAL VOC 2007
mAP52.2
49
Perceptual SimilarityBAPPS (val)
2AFC (Overall)67.2
39
Image ClassificationVTAB v2 (test)
Mean Accuracy47.1
39
Video Object SegmentationDAVIS (val)--
28
Showing 10 of 13 rows

Other info

Follow for update