Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization

About

There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.

Bruno Korbar, Du Tran, Lorenzo Torresani• 2018

Related benchmarks

TaskDatasetResultRank
Action RecognitionUCF101--
365
Action RecognitionUCF101 (mean of 3 splits)
Accuracy89
357
Audio ClassificationESC-50
Accuracy82.3
325
Action RecognitionUCF101 (test)
Accuracy90.5
307
Action RecognitionHMDB51 (test)
Accuracy0.668
249
Action RecognitionHMDB51
Top-1 Acc57.3
225
Action RecognitionHMDB-51 (average of three splits)
Top-1 Acc61.6
204
Action RecognitionUCF101 (3 splits)
Accuracy89
155
Action RecognitionUCF101 (Split 1)
Top-1 Acc89
105
Audio ClassificationESC50
Top-1 Acc80.6
64
Showing 10 of 24 rows

Other info

Follow for update