Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

About

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating the encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on only unmasked input, and a shallow decoder operates on encoder outputs and mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST using current audio pretraining strategies with ordinary model and input sizes. When fine-tuning on downstream tasks, which only uses the encoder, we find that our approach outperforms the SSAST on a variety of downstream tasks. We further conduct comprehensive evaluations into different strategies of pretraining and explore differences in MAE-style pretraining between the visual and audio domains.

Alan Baade, Puyuan Peng, David Harwath• 2022

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy90
374
Audio ClassificationAudioSet 20K
mAP30.6
128
Audio ClassificationUrbansound8K
Accuracy81.3
126
Musical Instrument ClassificationNSynth
Accuracy71.2
106
Environmental Sound ClassificationFSD50K
mAP41.1
91
Audio ClassificationSPC V2
Accuracy97.9
65
Audio ClassificationESC50
Top-1 Acc90
64
Keyword SpottingSpeech Commands V2
Accuracy98
61
Audio ClassificationGTZAN
Accuracy64.1
59
Speaker IdentificationVoxCeleb1
Accuracy63.3
58
Showing 10 of 26 rows

Other info

Follow for update