Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VicTR: Video-conditioned Text Representations for Activity Recognition

About

Vision-Language models (VLMs) have excelled in the image-domain -- especially in zero-shot settings -- thanks to the availability of vast pretraining data (i.e., paired image-text samples). However for videos, such paired data is not as abundant. Therefore, video-VLMs are usually designed by adapting pretrained image-VLMs to the video-domain, instead of training from scratch. All such recipes rely on augmenting visual embeddings with temporal information (i.e., image $\rightarrow$ video), often keeping text embeddings unchanged or even being discarded. In this paper, we argue the contrary, that better video-VLMs can be designed by focusing more on augmenting text, rather than visual information. More specifically, we introduce Video-conditioned Text Representations (VicTR): a form of text embeddings optimized w.r.t. visual embeddings, creating a more-flexible contrastive latent space. Our model can further make use of freely-available semantic information, in the form of visually-grounded auxiliary text (e.g. object or scene information). We evaluate our model on few-shot, zero-shot (HMDB-51, UCF-101), short-form (Kinetics-400) and long-form (Charades) activity recognition benchmarks, showing strong performance among video-VLMs.

Kumara Kahatapitiya, Anurag Arnab, Arsha Nagrani, Michael S. Ryoo• 2023

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF101 (Split 1)
Top-1 Acc95.8
105
Action RecognitionHMDB51
Accuracy (HMDB51)51
78
Action RecognitionHMDB51 (split 1)
Top-1 Acc70.7
75
Action RecognitionCharades
mAP0.576
64
Action RecognitionKinetics400 (val)
Accuracy87
40
Activity RecognitionHMDB-51 first split among three (test)
Top-1 Accuracy51
10
Activity RecognitionUCF-101 first split among three (test)
Top-1 Accuracy72.4
10
Video Question AnsweringNExT-QA zero-shot
Accuracy0.455
7
Showing 8 of 8 rows

Other info

Follow for update