Masked Autoencoders that Listen

About

This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-supervised representation learning from audio spectrograms. Following the Transformer encoder-decoder design in MAE, our Audio-MAE first encodes audio spectrogram patches with a high masking ratio, feeding only the non-masked tokens through encoder layers. The decoder then re-orders and decodes the encoded context padded with mask tokens, in order to reconstruct the input spectrogram. We find it beneficial to incorporate local window attention in the decoder, as audio spectrograms are highly correlated in local time and frequency bands. We then fine-tune the encoder with a lower masking ratio on target datasets. Empirically, Audio-MAE sets new state-of-the-art performance on six audio and speech classification tasks, outperforming other recent models that use external supervised pre-training. The code and models will be at https://github.com/facebookresearch/AudioMAE.

Po-Yao Huang, Hu Xu, Juncheng Li, Alexei Baevski, Michael Auli, Wojciech Galuba, Florian Metze, Christoph Feichtenhofer• 2022

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy97.4	441
Audio Classification	AudioSet 20K	mAP37.6	147
Audio Classification	AudioSet 2M	mAP47.4	98
Audio Classification	SPC V2	Accuracy98.3	65
Audio Classification	ESC50	Top-1 Acc93.6	64
Keyword Spotting	Speech Commands V2	Accuracy98.3	61
Speaker Identification	VoxCeleb1	Accuracy94.8	58
Classification	AudioSet (test)	mAP47.3	57
Audio Classification	Speech Commands V2 (test)	Accuracy98.3	46
Audio Event Tagging	AudioSet AS-2M (full)	mAP47.3	45

Showing 10 of 54 rows

Other info

Code

Follow for update

@wizwand_team Discord