Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self-Supervised Learning of Audio-Visual Objects from Video

About

Our objective is to transform a video into a set of discrete audio-visual objects using self-supervised learning. To this end, we introduce a model that uses attention to localize and group sound sources, and optical flow to aggregate information over time. We demonstrate the effectiveness of the audio-visual object embeddings that our model learns by using them for four downstream speech-oriented tasks: (a) multi-speaker sound source separation, (b) localizing and tracking speakers, (c) correcting misaligned audio-visual data, and (d) active speaker detection. Using our representation, these tasks can be solved entirely by training on unlabeled video, without the aid of object detectors. We also demonstrate the generality of our method by applying it to non-human speakers, including cartoons and puppets.Our model significantly outperforms other self-supervised approaches, and obtains performance competitive with methods that use supervised face detection.

Triantafyllos Afouras, Andrew Owens, Joon Son Chung, Andrew Zisserman• 2020

Related benchmarks

TaskDatasetResultRank
Audio-visual speech separationLRS2-2Mix (test)
SI-SNRi10.8
33
Audio-visual speech separationLRS3 (test)
SDRi7.71
20
Visual Sound Source LocalizationVGG-SS (test)
LocAcc29.7
19
Audio-visual speech separationLRS2 (test)
SDRi6.88
14
Speech SeparationVoxCeleb2-2Mix (test)
SDRi4.8
12
Sound Source LocalizationVGGSound Source
cIoU29.7
9
Active Speaker DetectionColumbia dataset
Weighted F1 (Bell)82.4
9
Active Speaker DetectionColumbia
F1 (Bell)0.926
7
Speaker SeparationLRS2 synthetic (test)
SDR8.86
7
Speaker SeparationLRS3 synthetic (test)
SDR9.72
7
Showing 10 of 12 rows

Other info

Follow for update