Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training

About

Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101, and 62.6% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.

Zhan Tong, Yibing Song, Jue Wang, Limin Wang• 2022

Related benchmarks

TaskDatasetResultRank
Video Object SegmentationDAVIS 2017 (val)
J mean54.9
1130
Image ClassificationImageNet-1K
Top-1 Acc81.1
836
Action RecognitionSomething-Something v2 (val)
Top-1 Accuracy75.4
535
Action RecognitionKinetics-400
Top-1 Acc86.1
413
Action RecognitionUCF101
Accuracy96.1
365
Action RecognitionSomething-Something v2
Top-1 Accuracy75.4
341
Action RecognitionSomething-Something v2 (test)
Top-1 Acc75.4
333
Action RecognitionUCF101 (test)
Accuracy96.1
307
Action RecognitionHMDB51 (test)
Accuracy0.733
249
Action RecognitionKinetics 400 (test)
Top-1 Accuracy86.1
245
Showing 10 of 161 rows
...

Other info

Code

Follow for update