Multimodal Clustering Networks for Self-supervised Learning from Unlabeled Videos

About

Multimodal self-supervised learning is getting more and more attention as it allows not only to train large networks without human supervision but also to search and retrieve data across various modalities. In this context, this paper proposes a self-supervised training framework that learns a common multimodal embedding space that, in addition to sharing representations across different modalities, enforces a grouping of semantically similar instances. To this end, we extend the concept of instance-level contrastive learning with a multimodal clustering step in the training pipeline to capture semantic similarities across modalities. The resulting embedding space enables retrieval of samples across all modalities, even from unseen datasets and different domains. To evaluate our approach, we train our model on the HowTo100M dataset and evaluate its zero-shot retrieval capabilities in two challenging domains, namely text-to-video retrieval, and temporal action localization, showing state-of-the-art results on four different datasets.

Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, James Glass, Michael Picheny, Shih-Fu Chang• 2021

Related benchmarks

Task	Dataset	Result
Text-to-Video Retrieval	MSR-VTT (test)	R@110.5	271
Text-to-Video Retrieval	MSR-VTT (1k-A)	R@1033.8	211
Text-to-Video Retrieval	MSRVTT	R@110.5	144
Text-to-Video Retrieval	YouCook2	Recall@1081.4	117
Text-to-Video Retrieval	Youcook2 (test)	Recall@1045.2	59
Action Step Localization	CrossTask (test)	Recall35.1	32
Action Step Localization	CrossTask	Average Recall35.1	28
Text-to-Video Retrieval	MSR-VTT 1K videos (test)	Recall@1033.8	25
Video-paragraph retrieval	YouCookII Background Removed (test)	R@153.4	12
Video Retrieval (clip-caption)	YouCookII (evaluation)	Recall@118.1	11

Showing 10 of 20 rows

Other info

Follow for update

@wizwand_team Discord