Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Large-scale weakly-supervised pre-training for video action recognition

About

Current fully-supervised video datasets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures. This paper presents an in-depth study of using large volumes of web videos for pre-training video models for the task of action recognition. Our primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets. Further, we examine three questions in the construction of weakly-supervised video action datasets. First, given that actions involve interactions with objects, how should one construct a verb-object pre-training label space to benefit transfer learning the most? Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning? Finally, actions are generally less well-localized in long videos vs. short videos; since action labels are provided at a video level, how should one choose video clips for best performance, given some fixed budget of number or minutes of videos?

Deepti Ghadiyaram, Matt Feiszli, Du Tran, Xueting Yan, Heng Wang, Dhruv Mahajan• 2019

Related benchmarks

TaskDatasetResultRank
Video ClassificationKinetics 400 (val)
Top-1 Acc81.3
204
Video ClassificationSomething-something v1 (test)
Top-1 Accuracy51.6
115
Action RecognitionKinetics
Top-1 Acc82.8
83
Action RecognitionEPIC-KITCHENS (val)
Verb Top-1 Acc58.4
36
Action RecognitionEPIC-Kitchens v1 (test s2 (unseen))
Actions Top-1 Acc25.6
32
Action RecognitionEPIC-Kitchens s1 (seen) v1 (test)
Actions Top-1 Accuracy34.5
29
Action RecognitionSomething-Something (val)
Top-1 Accuracy51.6
18
Egocentric Action RecognitionEPIC-KITCHENS S2 (test)
Top-1 Accuracy (Verb)55.24
16
Egocentric Action RecognitionEPIC-Kitchens test (S1)
Top-1 Acc (Verb)64.14
16
Action RecognitionEPIC-Kitchens v1 (val)
Verbs Top-1 Acc58.4
15
Showing 10 of 12 rows

Other info

Code

Follow for update