Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Shuffle and Learn: Unsupervised Learning using Temporal Order Verification

About

In this paper, we present an approach for learning a visual representation from the raw spatiotemporal signals in videos. Our representation is learned without supervision from semantic labels. We formulate our method as an unsupervised sequential verification task, i.e., we determine whether a sequence of frames from a video is in the correct temporal order. With this simple task and no semantic labels, we learn a powerful visual representation using a Convolutional Neural Network (CNN). The representation contains complementary information to that learned from supervised image datasets like ImageNet. Qualitative results show that our method captures information that is temporally varying, such as human pose. When used as pre-training for action recognition, our method gives significant gains over learning without external data on benchmark datasets like UCF101 and HMDB51. To demonstrate its sensitivity to human pose, we show results for pose estimation on the FLIC and MPII datasets that are competitive, or better than approaches using significantly more supervision. Our method can be combined with supervised representations to provide an additional boost in accuracy.

Ishan Misra, C. Lawrence Zitnick, Martial Hebert• 2016

Related benchmarks

TaskDatasetResultRank
Object DetectionPASCAL VOC 2007 (test)
mAP39.9
821
Action RecognitionNTU RGB+D 60 (Cross-View)
Accuracy40.9
575
Action RecognitionUCF101 (mean of 3 splits)
Accuracy68.7
357
Human Pose EstimationMPII (test)--
314
Action RecognitionUCF101 (test)
Accuracy50.2
307
Action RecognitionNTU RGB-D Cross-Subject 60
Accuracy46.2
305
Action RecognitionHMDB51 (test)
Accuracy0.181
249
Action RecognitionHMDB51
Top-1 Acc19.8
225
ClassificationPASCAL VOC 2007 (test)
mAP (%)54.3
217
Action RecognitionHMDB-51 (average of three splits)
Top-1 Acc35.8
204
Showing 10 of 55 rows

Other info

Code

Follow for update