VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
About
We explore an efficient approach to establish a foundational video-text model. We present VideoCoCa that maximally reuses a pretrained image-text contrastive captioner (CoCa) model and adapt it to video-text tasks with minimal extra training. While previous works adapt image-text models with various cross-frame fusion modules, we find that the generative attentional pooling and contrastive attentional pooling layers in CoCa are instantly adaptable to flattened frame embeddings, yielding state-of-the-art results on zero-shot video classification and zero-shot text-to-video retrieval. Furthermore, we explore lightweight finetuning on top of VideoCoCa, and achieve strong results on video question-answering and video captioning.
Shen Yan, Tao Zhu, Zirui Wang, Yuan Cao, Mi Zhang, Soham Ghosh, Yonghui Wu, Jiahui Yu• 2022
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Question Answering | MSRVTT-QA | Accuracy46 | 481 | |
| Action Recognition | Kinetics-400 | Top-1 Acc72 | 413 | |
| Video Question Answering | MSRVTT-QA (test) | Accuracy46.3 | 371 | |
| Action Recognition | UCF101 | -- | 365 | |
| Video Question Answering | MSVD-QA | Accuracy56.9 | 340 | |
| Video Question Answering | ActivityNet-QA | Accuracy46 | 319 | |
| Text-to-Video Retrieval | MSR-VTT | Recall@134.3 | 313 | |
| Video Question Answering | ActivityNet-QA (test) | Accuracy56.1 | 275 | |
| Video Question Answering | MSVD-QA (test) | Accuracy56.9 | 274 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@134.3 | 234 |
Showing 10 of 50 rows