Self-supervised video pretraining yields robust and more human-aligned visual representations

About

Humans learn powerful representations of objects and scenes by observing how they evolve over time. Yet, outside of specific tasks that require explicit temporal understanding, static image pretraining remains the dominant paradigm for learning visual foundation models. We question this mismatch, and ask whether video pretraining can yield visual representations that bear the hallmarks of human perception: generalisation across tasks, robustness to perturbations, and consistency with human judgements. To that end we propose a novel procedure for curating videos, and develop a contrastive framework which learns from the complex transformations therein. This simple paradigm for distilling knowledge from videos, called VITO, yields general representations that far outperform prior video pretraining methods on image understanding tasks, and image pretraining methods on video understanding tasks. Moreover, VITO representations are significantly more robust to natural and synthetic deformations than image-, video-, and adversarially-trained ones. Finally, VITO's predictions are strongly aligned with human judgements, surpassing models that were specifically trained for that purpose. Together, these results suggest that video pretraining could be a simple way of learning unified, robust, and human-aligned representations of the visual world.

Nikhil Parthasarathy, S. M. Ali Eslami, Jo\~ao Carreira, Olivier J. H\'enaff• 2022

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	mIoU39.4	3089
Video Object Segmentation	DAVIS 2017 (val)	J mean65.5	1251
Semantic segmentation	ADE20K	mIoU39.4	1028
Object Detection	COCO (val)	mAP44	637
Action Recognition	UCF101 (test)	--	376
Object Detection	LVIS (val)	mAP25.7	174
Object Detection	COCO	mAP44	137
Video segmentation	DAVIS	J&F Score68.2	53
Action Recognition	UCF101	Top-1 Acc87.4	48
Semantic segmentation	PASCAL (train)	mIoU76.3	11

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord