Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

About

This paper introduces InternVid, a large-scale video-centric multimodal dataset that enables learning powerful and transferable video-text representations for multimodal understanding and generation. The InternVid dataset contains over 7 million videos lasting nearly 760K hours, yielding 234M video clips accompanied by detailed descriptions of total 4.1B words. Our core contribution is to develop a scalable approach to autonomously build a high-quality video-text dataset with large language models (LLM), thereby showcasing its efficacy in learning video-language representation at scale. Specifically, we utilize a multi-scale approach to generate video-related descriptions. Furthermore, we introduce ViCLIP, a video-text representation learning model based on ViT-L. Learned on InternVid via contrastive learning, this model demonstrates leading zero-shot action recognition and competitive video retrieval performance. Beyond basic video understanding tasks like recognition and retrieval, our dataset and model have broad applications. They are particularly beneficial for generating interleaved video-text data for learning a video-centric dialogue system, advancing video-to-text and text-to-video generation research. These proposed resources provide a tool for researchers and practitioners interested in multimodal video understanding and generation.

Yi Wang, Yinan He, Yizhuo Li, Kunchang Li, Jiashuo Yu, Xin Ma, Xinhao Li, Guo Chen, Xinyuan Chen, Yaohui Wang, Conghui He, Ping Luo, Ziwei Liu, Yali Wang, Limin Wang, Yu Qiao• 2023

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics-400
Top-1 Acc64.8
413
Text-to-Video RetrievalDiDeMo
R@10.494
360
Text-to-Video RetrievalMSR-VTT
Recall@152.5
313
Text-to-Video RetrievalMSVD
R@149.1
218
Text-to-Video RetrievalActivityNet
R@115.1
197
Video Action RecognitionKinetics-400
Top-1 Acc82.4
184
Video-to-Text retrievalMSR-VTT
Recall@151.8
157
Text-to-Video RetrievalLSMDC
R@133
154
Video Action RecognitionUCF101
Top-1 Acc95.2
153
Video Action ClassificationSomething-Something v2
Top-1 Acc67.9
139
Showing 10 of 88 rows
...

Other info

Follow for update