End-to-End Learning of Visual Representations from Uncurated Instructional Videos

About

Annotating videos is cumbersome, expensive and not scalable. Yet, many strong video models still rely on manually annotated data. With the recent introduction of the HowTo100M dataset, narrated videos now offer the possibility of learning video representations without manual supervision. In this work we propose a new learning approach, MIL-NCE, capable of addressing misalignments inherent to narrated videos. With this approach we are able to learn strong video representations from scratch, without the need for any manual annotation. We evaluate our representations on a wide range of four downstream tasks over eight datasets: action recognition (HMDB-51, UCF-101, Kinetics-700), text-to-video retrieval (YouCook2, MSR-VTT), action localization (YouTube-8M Segments, CrossTask) and action segmentation (COIN). Our method outperforms all published self-supervised approaches for these tasks as well as several fully supervised baselines.

Antoine Miech, Jean-Baptiste Alayrac, Lucas Smaira, Ivan Laptev, Josef Sivic, Andrew Zisserman• 2019

Related benchmarks

Task	Dataset	Result
Action Recognition	UCF101	Accuracy91.3	433
Text-to-Video Retrieval	MSR-VTT	Recall@19.9	406
Action Recognition	UCF101 (mean of 3 splits)	Accuracy91.3	357
Action Recognition	UCF101 (test)	--	357
Text-to-Video Retrieval	MSR-VTT (test)	R@1990	265
Action Recognition	HMDB51	Top-1 Acc61	225
Text-to-Video Retrieval	MSR-VTT (1k-A)	R@1032.4	211
Action Recognition	HMDB-51 (average of three splits)	Top-1 Acc61	204
Text-to-Video Retrieval	MSRVTT (test)	Recall@524	178
Video Action Recognition	UCF101	Top-1 Acc91.3	165

Showing 10 of 99 rows

...

Other info

Code

Follow for update

@wizwand_team Discord