Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Unsupervised Learning of Visual Representations using Videos

About

Is strong supervision necessary for learning a good visual representation? Do we really need millions of semantically-labeled images to train a Convolutional Neural Network (CNN)? In this paper, we present a simple yet surprisingly powerful approach for unsupervised learning of CNN. Specifically, we use hundreds of thousands of unlabeled videos from the web to learn visual representations. Our key idea is that visual tracking provides the supervision. That is, two patches connected by a track should have similar visual representation in deep feature space since they probably belong to the same object or object part. We design a Siamese-triplet network with a ranking loss function to train this CNN representation. Without using a single image from ImageNet, just using 100K unlabeled videos and the VOC 2012 dataset, we train an ensemble of unsupervised networks that achieves 52% mAP (no bounding box regression). This performance comes tantalizingly close to its ImageNet-supervised counterpart, an ensemble which achieves a mAP of 54.4%. We also show that our unsupervised network can perform competitively in other tasks such as surface-normal estimation.

Xiaolong Wang, Abhinav Gupta• 2015

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1k (val)
Top-1 Accuracy38.8
1453
Object DetectionPASCAL VOC 2007 (test)
mAP60.2
821
Action RecognitionUCF101 (mean of 3 splits)
Accuracy42.7
357
Human Pose EstimationMPII (test)--
314
Action RecognitionUCF101 (test)
Accuracy41.5
307
ClassificationPASCAL VOC 2007 (test)
mAP (%)63.1
217
Action RecognitionHMDB-51 (average of three splits)
Top-1 Acc15.6
204
Semantic segmentationPascal VOC
mIoU0.354
172
Action RecognitionUCF101 (Split 1)--
105
Object DetectionPascal VOC
mAP47.2
88
Showing 10 of 20 rows

Other info

Follow for update