InternVideo: General Video Foundation Models via Generative and Discriminative Learning
About
The foundation models have recently shown excellent performance on a variety of downstream tasks in computer vision. However, most existing vision foundation models simply focus on image-level pretraining and adpation, which are limited for dynamic and complex video-level understanding tasks. To fill the gap, we present general video foundation models, InternVideo, by taking advantage of both generative and discriminative self-supervised video learning. Specifically, InternVideo efficiently explores masked video modeling and video-language contrastive learning as the pretraining objectives, and selectively coordinates video representations of these two complementary frameworks in a learnable manner to boost various video applications. Without bells and whistles, InternVideo achieves state-of-the-art performance on 39 video datasets from extensive tasks including video action recognition/detection, video-language alignment, and open-world video applications. Especially, our methods can obtain 91.1% and 77.2% top-1 accuracy on the challenging Kinetics-400 and Something-Something V2 benchmarks, respectively. All of these results effectively show the generality of our InternVideo for video understanding. The code will be released at https://github.com/OpenGVLab/InternVideo .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Something-Something v2 (val) | Top-1 Accuracy77.2 | 535 | |
| Video Question Answering | MSRVTT-QA | Accuracy47.1 | 481 | |
| Action Recognition | Kinetics-400 | Top-1 Acc91.1 | 413 | |
| Text-to-Video Retrieval | DiDeMo (test) | R@157.9 | 376 | |
| Video Question Answering | MSRVTT-QA (test) | Accuracy47.1 | 371 | |
| Action Recognition | UCF101 | -- | 365 | |
| Text-to-Video Retrieval | DiDeMo | R@10.579 | 360 | |
| Action Recognition | Something-Something v2 | Top-1 Accuracy77.2 | 341 | |
| Video Question Answering | MSVD-QA | Accuracy55.5 | 340 | |
| Text-to-Video Retrieval | MSR-VTT | Recall@155.2 | 313 |