Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Robust Audio-Visual Instance Discrimination

About

We present a self-supervised learning method to learn audio and video representations. Prior work uses the natural correspondence between audio and video to define a standard cross-modal instance discrimination task, where a model is trained to match representations from the two modalities. However, the standard approach introduces two sources of training noise. First, audio-visual correspondences often produce faulty positives since the audio and video signals can be uninformative of each other. To limit the detrimental impact of faulty positives, we optimize a weighted contrastive learning loss, which down-weighs their contribution to the overall loss. Second, since self-supervised contrastive learning relies on random sampling of negative instances, instances that are semantically similar to the base instance can be used as faulty negatives. To alleviate the impact of faulty negatives, we propose to optimize an instance discrimination loss with a soft target distribution that estimates relationships between instances. We validate our contributions through extensive experiments on action recognition tasks and show that they address the problems of audio-visual instance discrimination and improve transfer learning performance.

Pedro Morgado, Ishan Misra, Nuno Vasconcelos• 2021

Related benchmarks

TaskDatasetResultRank
Video Action RecognitionKinetics 400 (val)
Top-1 Acc48.9
166
Audio ClassificationESC50
Top-1 Acc89.2
64
Video Action RecognitionHMDB51 (avg over all splits)
Top-1 Acc64.7
56
Action RecognitionUCF101 1 (test)
Accuracy85.6
50
Video Action RecognitionUCF101 avg over all splits
Top-1 Accuracy91.5
42
Action RecognitionHMDB51 1 (test)
Top-1 Accuracy55
40
Driver distraction detectionDrive&Act Skeleton
Avg Balanced Accuracy39.86
9
Driver distraction detectionDrive&Act IR
Average Balanced Accuracy55.51
9
Driver distraction detectionDrive&Act Depth
Average Balanced Accuracy50.89
9
Driver distraction detectionDrive&Act Average of unseen IR views
Average Balanced Accuracy33
6
Showing 10 of 16 rows

Other info

Follow for update