Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Video Understanding as Machine Translation

About

With the advent of large-scale multimodal video datasets, especially sequences with audio or transcribed speech, there has been a growing interest in self-supervised learning of video representations. Most prior work formulates the objective as a contrastive metric learning problem between the modalities. To enable effective learning, however, these strategies require a careful selection of positive and negative samples often combined with hand-designed curriculum policies. In this work we remove the need for negative sampling by taking a generative modeling approach that poses the objective as a translation problem between modalities. Such a formulation allows us to tackle a wide variety of downstream video understanding tasks by means of a single unified framework, without the need for large batches of negative samples common in contrastive metric learning. We experiment with the large-scale HowTo100M dataset for training, and report performance gains over the state-of-the-art on several downstream tasks including video classification (EPIC-Kitchens), question answering (TVQA), captioning (TVC, YouCook2, and MSR-VTT), and text-based clip retrieval (YouCook2 and MSR-VTT).

Bruno Korbar, Fabio Petroni, Rohit Girdhar, Lorenzo Torresani• 2020

Related benchmarks

TaskDatasetResultRank
Text-to-Video RetrievalMSR-VTT
Recall@114.7
313
Text-to-Video RetrievalMSR-VTT (1k-A)
R@1052.8
211
Video CaptioningMSR-VTT (test)--
121
Text-to-Video RetrievalYouCook2
Recall@1043.9
117
Video CaptioningYouCook2
METEOR13.4
104
Video CaptioningYouCook II (val)--
98
Text-to-Video RetrievalMSR-VTT 7K
Recall@1052.8
27
Text-to-Video RetrievalMSRVTT 1K 1.0 (test)
R@114.7
23
Showing 8 of 8 rows

Other info

Follow for update