Large-scale weakly-supervised pre-training for video action recognition
About
Current fully-supervised video datasets consist of only a few hundred thousand videos and fewer than a thousand domain-specific labels. This hinders the progress towards advanced video architectures. This paper presents an in-depth study of using large volumes of web videos for pre-training video models for the task of action recognition. Our primary empirical finding is that pre-training at a very large scale (over 65 million videos), despite on noisy social-media videos and hashtags, substantially improves the state-of-the-art on three challenging public action recognition datasets. Further, we examine three questions in the construction of weakly-supervised video action datasets. First, given that actions involve interactions with objects, how should one construct a verb-object pre-training label space to benefit transfer learning the most? Second, frame-based models perform quite well on action recognition; is pre-training for good image features sufficient or is pre-training for spatio-temporal features valuable for optimal transfer learning? Finally, actions are generally less well-localized in long videos vs. short videos; since action labels are provided at a video level, how should one choose video clips for best performance, given some fixed budget of number or minutes of videos?
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Classification | Kinetics 400 (val) | Top-1 Acc81.3 | 204 | |
| Video Classification | Something-something v1 (test) | Top-1 Accuracy51.6 | 115 | |
| Action Recognition | Kinetics | Top-1 Acc82.8 | 83 | |
| Action Recognition | EPIC-KITCHENS (val) | Verb Top-1 Acc58.4 | 36 | |
| Action Recognition | EPIC-Kitchens v1 (test s2 (unseen)) | Actions Top-1 Acc25.6 | 32 | |
| Action Recognition | EPIC-Kitchens s1 (seen) v1 (test) | Actions Top-1 Accuracy34.5 | 29 | |
| Action Recognition | Something-Something (val) | Top-1 Accuracy51.6 | 18 | |
| Egocentric Action Recognition | EPIC-KITCHENS S2 (test) | Top-1 Accuracy (Verb)55.24 | 16 | |
| Egocentric Action Recognition | EPIC-Kitchens test (S1) | Top-1 Acc (Verb)64.14 | 16 | |
| Action Recognition | EPIC-Kitchens v1 (val) | Verbs Top-1 Acc58.4 | 15 |