Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-Supervised MultiModal Versatile Networks

About

Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.

Jean-Baptiste Alayrac, Adri\`a Recasens, Rosalia Schneider, Relja Arandjelovi\'c, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, Andrew Zisserman• 2020

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU32.5
2731
Semantic segmentationADE20K
mIoU32.5
936
Object DetectionCOCO (val)
mAP62.97
613
Image ClassificationImageNet
Top-1 Accuracy57.4
429
Action RecognitionUCF101
Accuracy92.5
365
Action RecognitionUCF101 (mean of 3 splits)
Accuracy95.2
357
Audio ClassificationESC-50
Accuracy88.9
325
Action RecognitionUCF101 (test)
Accuracy91.5
307
Text-to-Video RetrievalMSR-VTT (test)
R@19.3
234
Action RecognitionHMDB51
Top-1 Acc75
225
Showing 10 of 61 rows

Other info

Code

Follow for update