Memory-augmented Dense Predictive Coding for Video Representation Learning

About

The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condense representations, allowing to make multiple hypotheses efficiently. (ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.

Tengda Han, Weidi Xie, Andrew Zisserman• 2020

Related benchmarks

Task	Dataset	Result
Action Recognition	UCF101	Accuracy86.1	433
Action Recognition	UCF101 (test)	Accuracy54.1	376
Action Recognition	UCF101 (mean of 3 splits)	Accuracy69.2	357
Action Recognition	HMDB51 (test)	Accuracy0.412	249
Action Recognition	UCF-101	Top-1 Acc86.1	225
Action Recognition	HMDB51	Top-1 Acc54.5	225
Action Recognition	HMDB51	3-Fold Accuracy41.2	191
Video Action Recognition	UCF101	Top-1 Acc84.3	165
Video Recognition	HMDB51	Accuracy54.5	145
Video Recognition	UCF101	Accuracy86.1	111

Showing 10 of 26 rows

Other info

Follow for update

@wizwand_team Discord