Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-Supervised Audio-Visual Representation Learning with Relaxed Cross-Modal Synchronicity

About

We present CrissCross, a self-supervised framework for learning audio-visual representations. A novel notion is introduced in our framework whereby in addition to learning the intra-modal and standard 'synchronous' cross-modal relations, CrissCross also learns 'asynchronous' cross-modal relationships. We perform in-depth studies showing that by relaxing the temporal synchronicity between the audio and visual modalities, the network learns strong generalized representations useful for a variety of downstream tasks. To pretrain our proposed solution, we use 3 different datasets with varying sizes, Kinetics-Sound, Kinetics400, and AudioSet. The learned representations are evaluated on a number of downstream tasks namely action recognition, sound classification, and action retrieval. Our experiments show that CrissCross either outperforms or achieves performances on par with the current state-of-the-art self-supervised methods on action recognition and action retrieval with UCF101 and HMDB51, as well as sound classification with ESC50 and DCASE. Moreover, CrissCross outperforms fully-supervised pretraining while pretrained on Kinetics-Sound. The codes and pretrained models are available on the project website.

Pritam Sarkar, Ali Etemad• 2021

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics-400
Top-1 Acc50.1
413
Action RecognitionUCF101
Accuracy87.7
365
Audio ClassificationESC-50
Accuracy90.5
325
Video Action RecognitionKinetics 400 (val)
Top-1 Acc50.1
151
Action RecognitionUCF101 (Split 1)
Top-1 Acc92.4
105
Action RecognitionHMDB51
Accuracy (HMDB51)56.2
78
Action RecognitionHMDB51 (split 1)
Top-1 Acc67.4
75
Audio ClassificationESC50
Top-1 Acc90.5
64
Video Action RecognitionHMDB51 (avg over all splits)
Top-1 Acc67.4
56
Video Action RecognitionUCF101 avg over all splits
Top-1 Accuracy92.4
42
Showing 10 of 14 rows

Other info

Code

Follow for update