Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Prompting Visual-Language Models for Efficient Video Understanding

About

Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On 10 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters.

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie• 2021

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics-400
Top-1 Acc76.9
413
Action RecognitionSomething-Something v2 (test)
Top-1 Acc9.7
333
Action RecognitionUCF101 (test)
Accuracy93
307
Temporal Action LocalizationActivityNet 1.3 (val)
AP@0.544
257
Action RecognitionHMDB51 (test)
Accuracy0.664
249
Action RecognitionKinetics 400 (test)
Top-1 Accuracy58.5
245
Action RecognitionHMDB51
Top-1 Acc44.3
225
Action RecognitionSomething-Something v2 (test val)
Top-1 Accuracy31
187
Video Action RecognitionKinetics-400
Top-1 Acc76.8
184
Action RecognitionUCF101 (3 splits)
Accuracy93.6
155
Showing 10 of 89 rows
...

Other info

Follow for update