Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

About

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating the encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on only unmasked input, and a shallow decoder operates on encoder outputs and mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST using current audio pretraining strategies with ordinary model and input sizes. When fine-tuning on downstream tasks, which only uses the encoder, we find that our approach outperforms the SSAST on a variety of downstream tasks. We further conduct comprehensive evaluations into different strategies of pretraining and explore differences in MAE-style pretraining between the visual and audio domains.

Alan Baade, Puyuan Peng, David Harwath• 2022

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy90
325
Audio ClassificationAudioSet 20K
mAP30.6
128
Audio ClassificationUrbansound8K
Accuracy81.3
116
Musical Instrument ClassificationNSynth
Accuracy71.2
75
Audio ClassificationSPC V2
Accuracy97.9
65
Audio ClassificationESC50
Top-1 Acc90
64
Keyword SpottingSpeech Commands V2
Accuracy98
61
Environmental Sound ClassificationFSD50K
mAP41.1
60
Speaker IdentificationVoxCeleb1
Accuracy63.3
58
ClassificationAudioSet (test)
mAP30.6
57
Showing 10 of 26 rows

Other info

Follow for update