Cooperative Learning of Audio and Video Models from Self-Supervised Synchronization
About
There is a natural correlation between the visual and auditive elements of a video. In this work we leverage this connection to learn general and effective models for both audio and video analysis from self-supervised temporal synchronization. We demonstrate that a calibrated curriculum learning scheme, a careful choice of negative examples, and the use of a contrastive loss are critical ingredients to obtain powerful multi-sensory representations from models optimized to discern temporal synchronization of audio-video pairs. Without further finetuning, the resulting audio features achieve performance superior or comparable to the state-of-the-art on established audio classification benchmarks (DCASE2014 and ESC-50). At the same time, our visual subnet provides a very effective initialization to improve the accuracy of video-based action recognition models: compared to learning from scratch, our self-supervised pretraining yields a remarkable gain of +19.9% in action recognition accuracy on UCF101 and a boost of +17.7% on HMDB51.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | UCF101 | -- | 365 | |
| Action Recognition | UCF101 (mean of 3 splits) | Accuracy89 | 357 | |
| Audio Classification | ESC-50 | Accuracy82.3 | 325 | |
| Action Recognition | UCF101 (test) | Accuracy90.5 | 307 | |
| Action Recognition | HMDB51 (test) | Accuracy0.668 | 249 | |
| Action Recognition | HMDB51 | Top-1 Acc57.3 | 225 | |
| Action Recognition | HMDB-51 (average of three splits) | Top-1 Acc61.6 | 204 | |
| Action Recognition | UCF101 (3 splits) | Accuracy89 | 155 | |
| Action Recognition | UCF101 (Split 1) | Top-1 Acc89 | 105 | |
| Audio Classification | ESC50 | Top-1 Acc80.6 | 64 |