Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners

About

We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.

Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu• 2022

Related benchmarks

TaskDatasetResultRank
Video Question AnsweringMSRVTT-QA
Accuracy46
481
Action RecognitionKinetics-400
Top-1 Acc72
413
Video Question AnsweringMSRVTT-QA (test)
Accuracy46.3
371
Action RecognitionUCF101--
365
Video Question AnsweringMSVD-QA
Accuracy56.9
340
Video Question AnsweringActivityNet-QA
Accuracy46
319
Text-to-Video RetrievalMSR-VTT
Recall@134.3
313
Video Question AnsweringActivityNet-QA (test)
Accuracy56.1
275
Video Question AnsweringMSVD-QA (test)
Accuracy56.9
274
Text-to-Video RetrievalMSR-VTT (test)
R@134.3
234
Showing 10 of 50 rows

Other info

Follow for update