Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Is ImageNet worth 1 video? Learning strong image encoders from 1 long unlabelled video

About

Self-supervised learning has unlocked the potential of scaling up pretraining to billions of images, since annotation is unnecessary. But are we making the best use of data? How more economical can we be? In this work, we attempt to answer this question by making two contributions. First, we investigate first-person videos and introduce a "Walking Tours" dataset. These videos are high-resolution, hours-long, captured in a single uninterrupted take, depicting a large number of objects and actions with natural scene transitions. They are unlabeled and uncurated, thus realistic for self-supervision and comparable with human learning. Second, we introduce a novel self-supervised image pretraining method tailored for learning from continuous videos. Existing methods typically adapt image-based pretraining approaches to incorporate more frames. Instead, we advocate a "tracking to learn to recognize" approach. Our method called DoRA, leads to attention maps that Discover and tRAck objects over time in an end-to-end manner, using transformer cross-attention. We derive multiple views from the tracks and use them in a classical self-supervised distillation loss. Using our novel approach, a single Walking Tours video remarkably becomes a strong competitor to ImageNet for several image and video downstream tasks.

Shashanka Venkataramanan, Mamshad Nayeem Rizve, Jo\~ao Carreira, Yuki M. Asano, Yannis Avrithis• 2023

Related benchmarks

TaskDatasetResultRank
Object DetectionCOCO 2017 (val)
AP23.2
2643
Instance SegmentationCOCO 2017 (val)--
1201
Video Object SegmentationDAVIS 2017 (val)
J mean51.9
1193
Image ClassificationImageNet-1K--
600
Visual Object TrackingTrackingNet (test)
Normalized Precision (Pnorm)82.5
463
Object TrackingLaSoT
AUC61.7
411
Visual Object TrackingGOT-10k (test)
Average Overlap63.8
408
Video Object SegmentationDAVIS 2017
Jaccard Index (J)51.9
82
Image ClassificationImageNet-1k (val)
Accuracy34.8
59
Unsupervised Object DiscoveryPASCAL VOC 2012
CorLoc24.1
42
Showing 10 of 17 rows

Other info

Follow for update