MAE-AST: Masked Autoencoding Audio Spectrogram Transformer

About

In this paper, we propose a simple yet powerful improvement over the recent Self-Supervised Audio Spectrogram Transformer (SSAST) model for speech and audio classification. Specifically, we leverage the insight that the SSAST uses a very high masking ratio (75%) during pretraining, meaning that the vast majority of self-attention compute is performed on mask tokens. We address this by integrating the encoder-decoder architecture from Masked Autoencoders are Scalable Vision Learners (MAE) into the SSAST, where a deep encoder operates on only unmasked input, and a shallow decoder operates on encoder outputs and mask tokens. We find that MAE-like pretraining can provide a 3x speedup and 2x memory usage reduction over the vanilla SSAST using current audio pretraining strategies with ordinary model and input sizes. When fine-tuning on downstream tasks, which only uses the encoder, we find that our approach outperforms the SSAST on a variety of downstream tasks. We further conduct comprehensive evaluations into different strategies of pretraining and explore differences in MAE-style pretraining between the visual and audio domains.

Alan Baade, Puyuan Peng, David Harwath• 2022

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy90	441
Audio Classification	AudioSet 20K	mAP30.6	147
Audio Classification	Urbansound8K	Accuracy81.3	126
Musical Instrument Classification	NSynth	Accuracy71.2	117
Environmental Sound Classification	FSD50K	mAP41.1	91
Audio Classification	SPC V2	Accuracy97.9	65
Audio Classification	ESC50	Top-1 Acc90	64
Keyword Spotting	Speech Commands V2	Accuracy98	61
Audio Classification	GTZAN	Accuracy64.1	59
Speaker Identification	VoxCeleb1	Accuracy63.3	58

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord