Audiovisual Masked Autoencoders

About

Can we leverage the audiovisual information already present in video to improve self-supervised representation learning? To answer this question, we study various pretraining architectures and objectives within the masked autoencoding framework, motivated by the success of similar methods in natural language and image understanding. We show that we can achieve significant improvements on audiovisual downstream classification tasks, surpassing the state-of-the-art on VGGSound and AudioSet. Furthermore, we can leverage our audiovisual pretraining scheme for multiple unimodal downstream tasks using a single audiovisual pretrained model. We additionally demonstrate the transferability of our representations, achieving state-of-the-art audiovisual results on Epic Kitchens without pretraining specifically for this dataset.

Mariana-Iuliana Georgescu, Eduardo Fonseca, Radu Tudor Ionescu, Mario Lucic, Cordelia Schmid, Anurag Arnab• 2022

Related benchmarks

Task	Dataset	Result
Action Recognition	EPIC-KITCHENS 100 (test)	Top-1 Verb Acc71.4	101
Audio Classification	AudioSet	mAP46.6	60
Audio-Visual Classification	VGGSound	Top-1 Acc65	37
Audio-Visual Event Classification	AudioSet 2M	mAP (Audio-only)46.6	16
Audio-Visual Classification	VGGSound Music	Top-1 Accuracy67.61	12
Audio-Visual Classification	AudioSet	Top-1 Accuracy51.32	12
Audiovisual Classification	AudioSet	mAP51.8	6
Video-only Classification	AudioSet	mAP31.1	5
Supervised Event Localization	AVE	Audio-only Accuracy82.3	3

Showing 9 of 9 rows

Other info

Code

Follow for update

@wizwand_team Discord