Video Representation Learning by Dense Predictive Coding

About

The objective of this paper is self-supervised learning of spatio-temporal embeddings from video, suitable for human action recognition. We make three contributions: First, we introduce the Dense Predictive Coding (DPC) framework for self-supervised representation learning on videos. This learns a dense encoding of spatio-temporal blocks by recurrently predicting future representations; Second, we propose a curriculum training scheme to predict further into the future with progressively less temporal context. This encourages the model to only encode slowly varying spatial-temporal signals, therefore leading to semantic representations; Third, we evaluate the approach by first training the DPC model on the Kinetics-400 dataset with self-supervised learning, and then finetuning the representation on a downstream task, i.e. action recognition. With single stream (RGB only), DPC pretrained representations achieve state-of-the-art self-supervised performance on both UCF101(75.7% top1 acc) and HMDB51(35.7% top1 acc), outperforming all previous learning methods by a significant margin, and approaching the performance of a baseline pre-trained on ImageNet.

Tengda Han, Weidi Xie, Andrew Zisserman• 2019

Related benchmarks

Task	Dataset	Result
Action Recognition	UCF101	Accuracy75.7	433
Action Recognition	UCF101 (test)	Accuracy75.7	357
Action Recognition	UCF101 (mean of 3 splits)	Accuracy75.7	357
Action Recognition	HMDB51 (test)	Accuracy0.357	249
Action Recognition	UCF-101	Top-1 Acc75.7	225
Action Recognition	HMDB51	Top-1 Acc35.7	225
Action Recognition	HMDB-51 (average of three splits)	Top-1 Acc35.7	204
Action Recognition	HMDB51	3-Fold Accuracy35.7	191
Video Action Recognition	UCF101	Top-1 Acc68.2	165
Video Recognition	HMDB51	Accuracy35.7	145

Showing 10 of 26 rows

Other info

Code

Follow for update

@wizwand_team Discord