Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Video Representation Learning by Recognizing Temporal Transformations

About

We introduce a novel self-supervised learning approach to learn representations of videos that are responsive to changes in the motion dynamics. Our representations can be learned from data without human annotation and provide a substantial boost to the training of neural networks on small labeled data sets for tasks such as action recognition, which require to accurately distinguish the motion of objects. We promote an accurate learning of motion without human annotation by training a neural network to discriminate a video sequence from its temporally transformed versions. To learn to distinguish non-trivial motions, the design of the transformations is based on two principles: 1) To define clusters of motions based on time warps of different magnitude; 2) To ensure that the discrimination is feasible only by observing and analyzing as many image frames as possible. Thus, we introduce the following transformations: forward-backward playback, random frame skipping, and uniform frame skipping. Our experiments show that networks trained with the proposed method yield representations with improved transfer performance for action recognition on UCF101 and HMDB51.

Simon Jenni, Givi Meishvili, Paolo Favaro• 2020

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF101 (mean of 3 splits)
Accuracy81.6
357
Action RecognitionUCF101 (test)
Accuracy81.6
307
Action RecognitionHMDB51 (test)
Accuracy0.464
249
Action RecognitionHMDB51
Top-1 Acc49.8
225
Video Action RecognitionUCF101
Top-1 Acc81.6
153
Action RecognitionUCF-101
Top-1 Acc81.6
147
Action ClassificationHMDB51 (over all three splits)
Accuracy49.8
121
Video Action RecognitionHMDB51
Top-1 Accuracy47.5
103
Video RetrievalUCF101 (1)
Top-1 Acc26.1
92
Video RecognitionHMDB51
Accuracy49.8
89
Showing 10 of 17 rows

Other info

Follow for update