Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language

About

Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.

Alexei Baevski, Arun Babu, Wei-Ning Hsu, Michael Auli• 2022

Related benchmarks

TaskDatasetResultRank
Semantic segmentationADE20K (val)
mIoU54.4
2731
Semantic segmentationADE20K
mIoU42.4
936
Natural Language UnderstandingGLUE (dev)
SST-2 (Acc)92.9
504
Image ClassificationImageNet-1k 1.0 (test)
Top-1 Accuracy86.6
197
Image ClassificationiNaturalist 2018 (test)
Top-1 Accuracy81
192
Image ClassificationImageNet-1K
Accuracy86.6
190
Image ClassificationiNaturalist 18
Overall Accuracy81
125
Image ClassificationVTAB-6
Accuracy83.1
29
Image ClassificationImageNet-1K 1.0 (val)
1-shot Acc24.1
25
Speech Emotion RecognitionSAVEE
WA78.59
23
Showing 10 of 28 rows

Other info

Code

Follow for update