Prompting Visual-Language Models for Efficient Video Understanding
About
Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On 10 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Kinetics-400 | Top-1 Acc76.9 | 413 | |
| Action Recognition | Something-Something v2 (test) | Top-1 Acc9.7 | 333 | |
| Action Recognition | UCF101 (test) | Accuracy93 | 307 | |
| Temporal Action Localization | ActivityNet 1.3 (val) | AP@0.544 | 257 | |
| Action Recognition | HMDB51 (test) | Accuracy0.664 | 249 | |
| Action Recognition | Kinetics 400 (test) | Top-1 Accuracy58.5 | 245 | |
| Action Recognition | HMDB51 | Top-1 Acc44.3 | 225 | |
| Action Recognition | Something-Something v2 (test val) | Top-1 Accuracy31 | 187 | |
| Video Action Recognition | Kinetics-400 | Top-1 Acc76.8 | 184 | |
| Action Recognition | UCF101 (3 splits) | Accuracy93.6 | 155 |