Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Harvest Video Foundation Models via Efficient Post-Pretraining

About

Building video-language foundation models is costly and difficult due to the redundant nature of video data and the lack of high-quality video-language datasets. In this paper, we propose an efficient framework to harvest video foundation models from image ones. Our method is intuitively simple by randomly dropping input video patches and masking out input text during the post-pretraining procedure. The patch dropping boosts the training efficiency significantly and text masking enforces the learning of cross-modal fusion. We conduct extensive experiments to validate the effectiveness of our method on a wide range of video-language downstream tasks including various zero-shot tasks, video question answering, and video-text retrieval. Despite its simplicity, our method achieves state-of-the-art performances, which are comparable to some heavily pretrained video foundation models. Our method is extremely efficient and can be trained in less than one day on 8 GPUs, requiring only WebVid-10M as pretraining data. We hope our method can serve as a simple yet strong counterpart for prevalent video foundation models, provide useful insights when building them, and make large pretrained models more accessible and sustainable. This is part of the InternVideo project https://github.com/OpenGVLab/InternVideo.

Yizhuo Li, Kunchang Li, Yinan He, Yi Wang, Yali Wang, Limin Wang, Yu Qiao, Ping Luo• 2023

Related benchmarks

TaskDatasetResultRank
Video-to-Text retrievalMSR-VTT
Recall@147.4
185
Video Question AnsweringTGIF-QA
Accuracy69.3
156
Video Question AnsweringMSVD
Accuracy52.4
152
Action RecognitionKinetics-400 full (val)
Top-1 Acc56.8
141
Video-to-Text retrievalDiDeMo
R@132.2
130
Video Question AnsweringMSRVTT
Accuracy44.8
100
Video-to-Text retrievalLSMDC
R@124.7
64
Video-Text RetrievalMSVD
R@151
29
Video-Text RetrievalMSR-VTT--
22
Multiple-ChoiceMSR-VTT
Accuracy93.5
11
Showing 10 of 12 rows

Other info

Follow for update