Memory-augmented Dense Predictive Coding for Video Representation Learning
About
The objective of this paper is self-supervised learning from video, in particular for representations for action recognition. We make the following contributions: (i) We propose a new architecture and learning framework Memory-augmented Dense Predictive Coding (MemDPC) for the task. It is trained with a predictive attention mechanism over the set of compressed memories, such that any future states can always be constructed by a convex combination of the condense representations, allowing to make multiple hypotheses efficiently. (ii) We investigate visual-only self-supervised video representation learning from RGB frames, or from unsupervised optical flow, or both. (iii) We thoroughly evaluate the quality of learnt representation on four different downstream tasks: action recognition, video retrieval, learning with scarce annotations, and unintentional action classification. In all cases, we demonstrate state-of-the-art or comparable performance over other approaches with orders of magnitude fewer training data.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | UCF101 | Accuracy86.1 | 365 | |
| Action Recognition | UCF101 (mean of 3 splits) | Accuracy69.2 | 357 | |
| Action Recognition | UCF101 (test) | Accuracy54.1 | 307 | |
| Action Recognition | HMDB51 (test) | Accuracy0.412 | 249 | |
| Action Recognition | HMDB51 | Top-1 Acc54.5 | 225 | |
| Action Recognition | HMDB51 | 3-Fold Accuracy41.2 | 191 | |
| Video Action Recognition | UCF101 | Top-1 Acc84.3 | 153 | |
| Action Recognition | UCF-101 | Top-1 Acc86.1 | 147 | |
| Video Retrieval | UCF101 (1) | Top-1 Acc40.2 | 92 | |
| Video Recognition | HMDB51 | Accuracy54.5 | 89 |