Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Video Cloze Procedure for Self-Supervised Spatio-Temporal Learning

About

We propose a novel self-supervised method, referred to as Video Cloze Procedure (VCP), to learn rich spatial-temporal representations. VCP first generates "blanks" by withholding video clips and then creates "options" by applying spatio-temporal operations on the withheld clips. Finally, it fills the blanks with "options" and learns representations by predicting the categories of operations applied on the clips. VCP can act as either a proxy task or a target task in self-supervised learning. As a proxy task, it converts rich self-supervised representations into video clip operations (options), which enhances the flexibility and reduces the complexity of representation learning. As a target task, it can assess learned representation models in a uniform and interpretable manner. With VCP, we train spatial-temporal representation models (3D-CNNs) and apply such models on action recognition and video retrieval tasks. Experiments on commonly used benchmarks show that the trained models outperform the state-of-the-art self-supervised models with significant margins.

Dezhao Luo, Chang Liu, Yu Zhou, Dongbao Yang, Can Ma, Qixiang Ye, Weiping Wang• 2020

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF101
Accuracy68.5
365
Action RecognitionUCF101 (mean of 3 splits)
Accuracy68.5
357
Action RecognitionUCF101 (test)
Accuracy68.5
307
Action RecognitionHMDB51 (test)
Accuracy0.325
249
Action RecognitionHMDB51
Top-1 Acc32.2
225
Action RecognitionHMDB-51 (average of three splits)
Top-1 Acc32.5
204
Action RecognitionHMDB51
3-Fold Accuracy32.5
191
Video Action RecognitionUCF101
Top-1 Acc66.3
153
Action RecognitionUCF-101
Top-1 Acc66.3
147
Action ClassificationHMDB51 (over all three splits)
Accuracy32.5
121
Showing 10 of 25 rows

Other info

Follow for update