Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Revisiting Feature Prediction for Learning Visual Representations from Video

About

This paper explores feature prediction as a stand-alone objective for unsupervised learning from video and introduces V-JEPA, a collection of vision models trained solely using a feature prediction objective, without the use of pretrained image encoders, text, negative examples, reconstruction, or other sources of supervision. The models are trained on 2 million videos collected from public datasets and are evaluated on downstream image and video tasks. Our results show that learning by predicting video features leads to versatile visual representations that perform well on both motion and appearance-based tasks, without adaption of the model's parameters; e.g., using a frozen backbone. Our largest model, a ViT-H/16 trained only on videos, obtains 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet1K.

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, Nicolas Ballas• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc75.9
836
Video Action ClassificationSomething-Something v2
Top-1 Acc74.3
139
Action RecognitionSSV2
Top-1 Acc71.4
93
Action RecognitionDiving-48
Top-1 Acc87.9
82
Video Action ClassificationDiving-48
Top-1 Acc87.9
53
Video Action ClassificationKinetics-400
Top-1 Accuracy0.845
48
Object ClassificationImageNet-1K
Top-1 Acc80
33
Video Action ClassificationCOIN
Top-1 Acc87.1
33
Action RecognitionK400
Top-1 Accuracy82
16
Detecting physically implausible eventsIntPhys2
Permanence (Fixed) Win Rate0.5962
13
Showing 10 of 18 rows

Other info

Follow for update