Learning Spatiotemporal Features via Video and Text Pair Discrimination
About
Current video representations heavily rely on learning from manually annotated video datasets which are time-consuming and expensive to acquire. We observe videos are naturally accompanied by abundant text information such as YouTube titles and Instagram captions. In this paper, we leverage this visual-textual connection to learn spatiotemporal features in an efficient weakly-supervised manner. We present a general cross-modal pair discrimination (CPD) framework to capture this correlation between a video and its associated text. Specifically, we adopt noise-contrastive estimation to tackle the computational issue imposed by the huge amount of pair instance classes and design a practical curriculum learning strategy. We train our CPD models on both standard video dataset (Kinetics-210k) and uncurated web video dataset (Instagram-300k) to demonstrate its effectiveness. Without further fine-tuning, the learnt models obtain competitive results for action classification on Kinetics under the linear classification protocol. Moreover, our visual model provides an effective initialization to fine-tune on downstream tasks, which yields a remarkable performance gain for action recognition on UCF101 and HMDB51, compared with the existing state-of-the-art self-supervised training methods. In addition, our CPD model yields a new state of the art for zero-shot action recognition on UCF101 by directly utilizing the learnt visual-textual embeddings. The code will be made available at https://github.com/MCG-NJU/CPD-Video.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | UCF101 | Accuracy92.8 | 365 | |
| Action Recognition | UCF101 (3 splits) | Accuracy39.9 | 155 | |
| Action Recognition | HMDB51 | Accuracy (HMDB51)63.8 | 78 | |
| Action Recognition | UCF-101 fine-tuning protocol | Accuracy92.8 | 35 | |
| Action Recognition | HMDB51 (fine-tuning) | Accuracy63.8 | 27 |