Expanding Language-Image Pretrained Models for General Video Recognition

About

Contrastive language-image pretraining has shown great success in learning visual-textual joint representation from web-scale data, demonstrating remarkable "zero-shot" generalization ability for various image tasks. However, how to effectively expand such new language-image pretraining methods to video domains is still an open problem. In this work, we present a simple yet effective approach that adapts the pretrained language-image models to video recognition directly, instead of pretraining a new model from scratch. More concretely, to capture the long-range dependencies of frames along the temporal dimension, we propose a cross-frame attention mechanism that explicitly exchanges information across frames. Such module is lightweight and can be plugged into pretrained language-image models seamlessly. Moreover, we propose a video-specific prompting scheme, which leverages video content information for generating discriminative textual prompts. Extensive experiments demonstrate that our approach is effective and can be generalized to different video recognition scenarios. In particular, under fully-supervised settings, our approach achieves a top-1 accuracy of 87.1% on Kinectics-400, while using 12 times fewer FLOPs compared with Swin-L and ViViT-H. In zero-shot experiments, our approach surpasses the current state-of-the-art methods by +7.6% and +14.9% in terms of top-1 accuracy under two popular protocols. In few-shot scenarios, our approach outperforms previous best methods by +32.1% and +23.1% when the labeled data is extremely limited. Code and models are available at https://aka.ms/X-CLIP

Bolin Ni, Houwen Peng, Minghao Chen, Songyang Zhang, Gaofeng Meng, Jianlong Fu, Shiming Xiang, Haibin Ling• 2022

Related benchmarks

Task	Dataset	Result
Action Recognition	Kinetics-400	Top-1 Acc84.7	498
Action Recognition	UCF101	--	433
Action Recognition	UCF101 (test)	Accuracy96.3	357
Action Recognition	Something-Something v2 (test)	Top-1 Acc10	333
Action Recognition	HMDB51 (test)	Accuracy0.717	249
Action Recognition	Kinetics 400 (test)	Top-1 Accuracy84.7	245
Action Recognition	HMDB51	Top-1 Acc44.6	225
Action Recognition	UCF-101	Top-1 Acc72	225
Video Action Recognition	Kinetics-400	Top-1 Acc87.7	197
Video Classification	Something-Something v2 (test)	Top-1 Acc0.578	169

Showing 10 of 136 rows

...

Other info

Code

Follow for update

@wizwand_team Discord