Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Video Representations using Contrastive Bidirectional Transformer

About

This paper proposes a self-supervised learning approach for video features that results in significantly improved performance on downstream tasks (such as video classification, captioning and segmentation) compared to existing methods. Our method extends the BERT model for text sequences to the case of sequences of real-valued feature vectors, by replacing the softmax loss with noise contrastive estimation (NCE). We also show how to learn representations from sequences of visual features and sequences of words derived from ASR (automatic speech recognition), and show that such cross-modal training (when possible) helps even more.

Chen Sun, Fabien Baradel, Kevin Murphy, Cordelia Schmid• 2019

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF101--
365
Action RecognitionUCF101 (mean of 3 splits)
Accuracy79.5
357
Action RecognitionUCF101 (test)
Accuracy54
307
Action RecognitionHMDB51 (test)
Accuracy0.446
249
Action RecognitionHMDB51
Top-1 Acc44.6
225
Action RecognitionHMDB-51 (average of three splits)
Top-1 Acc44.6
204
Video Action RecognitionUCF101
Top-1 Acc79.5
153
Action RecognitionUCF-101
Top-1 Acc79.5
147
Action RecognitionUCF101 (Split 1)
Top-1 Acc79.5
105
Video CaptioningYouCook2
METEOR12.97
104
Showing 10 of 23 rows

Other info

Follow for update