Efficient Self-supervised Learning with Contextualized Target Representations for Vision, Speech and Language
About
Current self-supervised learning algorithms are often modality-specific and require large amounts of computational resources. To address these issues, we increase the training efficiency of data2vec, a learning objective that generalizes across several modalities. We do not encode masked tokens, use a fast convolutional decoder and amortize the effort to build teacher representations. data2vec 2.0 benefits from the rich contextualized target representations introduced in data2vec which enable a fast self-supervised learner. Experiments on ImageNet-1K image classification show that data2vec 2.0 matches the accuracy of Masked Autoencoders in 16.4x lower pre-training time, on Librispeech speech recognition it performs as well as wav2vec 2.0 in 10.6x less time, and on GLUE natural language understanding it matches a retrained RoBERTa model in half the time. Trading some speed for accuracy results in ImageNet-1K top-1 accuracy of 86.8\% with a ViT-L model trained for 150 epochs.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU54.4 | 2731 | |
| Semantic segmentation | ADE20K | mIoU42.4 | 936 | |
| Natural Language Understanding | GLUE (dev) | SST-2 (Acc)92.9 | 504 | |
| Image Classification | ImageNet-1k 1.0 (test) | Top-1 Accuracy86.6 | 197 | |
| Image Classification | iNaturalist 2018 (test) | Top-1 Accuracy81 | 192 | |
| Image Classification | ImageNet-1K | Accuracy86.6 | 190 | |
| Image Classification | iNaturalist 18 | Overall Accuracy81 | 125 | |
| Image Classification | VTAB-6 | Accuracy83.1 | 29 | |
| Image Classification | ImageNet-1K 1.0 (val) | 1-shot Acc24.1 | 25 | |
| Speech Emotion Recognition | SAVEE | WA78.59 | 23 |