ActionCLIP: A New Paradigm for Video Action Recognition
About
The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Kinetics-400 | Top-1 Acc83.8 | 413 | |
| Action Recognition | Something-Something v2 (test) | Top-1 Acc11.1 | 333 | |
| Action Recognition | UCF101 (test) | Accuracy91.8 | 307 | |
| Action Recognition | HMDB51 (test) | Accuracy0.661 | 249 | |
| Action Recognition | Kinetics 400 (test) | Top-1 Accuracy83.8 | 245 | |
| Action Recognition | HMDB51 | Top-1 Acc40.8 | 225 | |
| Video Classification | Kinetics 400 (val) | Top-1 Acc56.4 | 204 | |
| Video Action Recognition | Kinetics-400 | Top-1 Acc83.8 | 184 | |
| Video Action Recognition | UCF101 | Top-1 Acc91.4 | 153 | |
| Action Recognition | UCF-101 | Top-1 Acc58.3 | 147 |