Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Video Representation Learning by Dense Predictive Coding

About

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

Tengda Han, Weidi Xie, Andrew Zisserman• 2019

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF101
Accuracy75.7
365
Action RecognitionUCF101 (mean of 3 splits)
Accuracy75.7
357
Action RecognitionUCF101 (test)
Accuracy75.7
307
Action RecognitionHMDB51 (test)
Accuracy0.357
249
Action RecognitionHMDB51
Top-1 Acc35.7
225
Action RecognitionHMDB-51 (average of three splits)
Top-1 Acc35.7
204
Action RecognitionHMDB51
3-Fold Accuracy35.7
191
Video Action RecognitionUCF101
Top-1 Acc68.2
153
Action RecognitionUCF-101
Top-1 Acc75.7
147
Action ClassificationHMDB51 (over all three splits)
Accuracy34.5
121
Showing 10 of 26 rows

Other info

Code

Follow for update