Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Masked Feature Prediction for Self-Supervised Visual Pre-Training

About

We present Masked Feature Prediction (MaskFeat) for self-supervised pre-training of video models. Our approach first randomly masks out a portion of the input sequence and then predicts the feature of the masked regions. We study five different types of features and find Histograms of Oriented Gradients (HOG), a hand-crafted feature descriptor, works particularly well in terms of both performance and efficiency. We observe that the local contrast normalization in HOG is essential for good results, which is in line with earlier work using HOG for visual recognition. Our approach can learn abundant visual knowledge and drive large-scale Transformer-based models. Without using extra model weights or supervision, MaskFeat pre-trained on unlabeled videos achieves unprecedented results of 86.7% with MViT-L on Kinetics-400, 88.3% on Kinetics-600, 80.4% on Kinetics-700, 39.8 mAP on AVA, and 75.0% on SSv2. MaskFeat further generalizes to image input, which can be interpreted as a video with a single frame and obtains competitive results on ImageNet.

Chen Wei, Haoqi Fan, Saining Xie, Chao-Yuan Wu, Alan Yuille, Christoph Feichtenhofer• 2021

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU48.8
2731
Object DetectionCOCO 2017 (val)--
2454
Image ClassificationImageNet-1K 1.0 (val)
Top-1 Accuracy85.7
1866
Instance SegmentationCOCO 2017 (val)--
1144
Semantic segmentationADE20K
mIoU48.8
936
Image ClassificationImageNet-1K
Top-1 Acc85.7
836
Image ClassificationImageNet 1k (test)
Top-1 Accuracy84
798
Action RecognitionSomething-Something v2 (val)
Top-1 Accuracy75
535
Image ClassificationImageNet-1K
Top-1 Acc84
524
Image ClassificationImageNet-1k (val)
Top-1 Accuracy85.7
512
Showing 10 of 60 rows

Other info

Code

Follow for update