Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Vita-CLIP: Video and text adaptive CLIP via Multimodal Prompting

About

Adopting contrastive image-text pretrained models like CLIP towards video classification has gained attention due to its cost-effectiveness and competitive performance. However, recent works in this area face a trade-off. Finetuning the pretrained model to achieve strong supervised performance results in low zero-shot generalization. Similarly, freezing the backbone to retain zero-shot capability causes significant drop in supervised accuracy. Because of this, recent works in literature typically train separate models for supervised and zero-shot action recognition. In this work, we propose a multimodal prompt learning scheme that works to balance the supervised and zero-shot performance under a single unified training. Our prompting approach on the vision side caters for three aspects: 1) Global video-level prompts to model the data distribution; 2) Local frame-level prompts to provide per-frame discriminative conditioning; and 3) a summary prompt to extract a condensed video representation. Additionally, we define a prompting scheme on the text side to augment the textual context. Through this prompting scheme, we can achieve state-of-the-art zero-shot performance on Kinetics-600, HMDB51 and UCF101 while remaining competitive in the supervised setting. By keeping the pretrained backbone frozen, we optimize a much lower number of parameters and retain the existing general representation which helps achieve the strong zero-shot performance. Our codes/models are released at https://github.com/TalalWasim/Vita-CLIP.

Syed Talal Wasim, Muzammal Naseer, Salman Khan, Fahad Shahbaz Khan, Mubarak Shah• 2023

Related benchmarks

TaskDatasetResultRank
Action RecognitionSomething-Something v2 (val)
Top-1 Accuracy48.7
535
Action RecognitionUCF101 (test)--
307
Action RecognitionHMDB51 (test)--
249
Action RecognitionKinetics-600 (test)
Top-1 Accuracy67.4
84
Video Action RecognitionHMDB51 (test)
Accuracy48.6
73
Video ClassificationSomething-Something v2
Top-1 Acc48.7
56
Video Action RecognitionUCF101 (test)
Top-1 Acc75
46
Zero-shot Action RecognitionUCF101 (test)
Accuracy75
33
Action RecognitionHRI-30
Overall Accuracy71
26
Zero-shot Action RecognitionHMDB51 (test)
Accuracy48.6
25
Showing 10 of 37 rows

Other info

Code

Follow for update