Self-Supervised MultiModal Versatile Networks
About
Videos are a rich source of multi-modal supervision. In this work, we learn representations using self-supervision by leveraging three modalities naturally present in videos: visual, audio and language streams. To this end, we introduce the notion of a multimodal versatile network -- a network that can ingest multiple modalities and whose representations enable downstream tasks in multiple modalities. In particular, we explore how best to combine the modalities, such that fine-grained representations of the visual and audio modalities can be maintained, whilst also integrating text into a common embedding. Driven by versatility, we also introduce a novel process of deflation, so that the networks can be effortlessly applied to the visual data in the form of video or a static image. We demonstrate how such networks trained on large collections of unlabelled video data can be applied on video, video-text, image and audio tasks. Equipped with these representations, we obtain state-of-the-art performance on multiple challenging benchmarks including UCF101, HMDB51, Kinetics600, AudioSet and ESC-50 when compared to previous self-supervised work. Our models are publicly available.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU32.5 | 2731 | |
| Semantic segmentation | ADE20K | mIoU32.5 | 936 | |
| Object Detection | COCO (val) | mAP62.97 | 613 | |
| Image Classification | ImageNet | Top-1 Accuracy57.4 | 429 | |
| Action Recognition | UCF101 | Accuracy92.5 | 365 | |
| Action Recognition | UCF101 (mean of 3 splits) | Accuracy95.2 | 357 | |
| Audio Classification | ESC-50 | Accuracy88.9 | 325 | |
| Action Recognition | UCF101 (test) | Accuracy91.5 | 307 | |
| Text-to-Video Retrieval | MSR-VTT (test) | R@19.3 | 234 | |
| Action Recognition | HMDB51 | Top-1 Acc75 | 225 |