Prompting Visual-Language Models for Efficient Video Understanding

About

Image-based visual-language (I-VL) pre-training has shown great success for learning joint visual-textual representations from large-scale web data, revealing remarkable ability for zero-shot generalisation. This paper presents a simple but strong baseline to efficiently adapt the pre-trained I-VL model, and exploit its powerful ability for resource-hungry video understanding tasks, with minimal training. Specifically, we propose to optimise a few random vectors, termed as continuous prompt vectors, that convert video-related tasks into the same format as the pre-training objectives. In addition, to bridge the gap between static images and videos, temporal information is encoded with lightweight Transformers stacking on top of frame-wise visual features. Experimentally, we conduct extensive ablation studies to analyse the critical components. On 10 public benchmarks of action recognition, action localisation, and text-video retrieval, across closed-set, few-shot, and zero-shot scenarios, we achieve competitive or state-of-the-art performance to existing methods, despite optimising significantly fewer parameters.

Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, Weidi Xie• 2021

Related benchmarks

Task	Dataset	Result
Action Recognition	Kinetics-400	Top-1 Acc76.9	498
Action Recognition	UCF101 (test)	Accuracy93	357
Action Recognition	Something-Something v2 (test)	Top-1 Acc9.7	333
Video Anomaly Detection	UCF-Crime	AUC84.72	263
Temporal Action Localization	ActivityNet 1.3 (val)	AP@0.544	257
Action Recognition	HMDB51 (test)	Accuracy0.664	249
Action Recognition	Kinetics 400 (test)	Top-1 Accuracy58.5	245
Action Recognition	UCF-101	Top-1 Acc89.9	225
Action Recognition	HMDB51	Top-1 Acc44.3	225
Video Action Recognition	Kinetics-400	Top-1 Acc76.8	197

Showing 10 of 100 rows

...

Other info

Follow for update

@wizwand_team Discord