Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Masked Spectrogram Prediction For Self-Supervised Audio Pre-Training

About

Transformer-based models attain excellent results and generalize well when trained on sufficient amounts of data. However, constrained by the limited data available in the audio domain, most transformer-based models for audio tasks are finetuned from pre-trained models in other domains (e.g. image), which has a notable gap with the audio domain. Other methods explore the self-supervised learning approaches directly in the audio domain but currently do not perform well in the downstream tasks. In this paper, we present a novel self-supervised learning method for transformer-based audio models, called masked spectrogram prediction (MaskSpec), to learn powerful audio representations from unlabeled audio data (AudioSet used in this paper). Our method masks random patches of the input spectrogram and reconstructs the masked regions with an encoder-decoder architecture. Without using extra model weights or supervision, experimental results on multiple downstream datasets demonstrate MaskSpec achieves a significant performance gain against the supervised methods and outperforms the previous pre-trained models. In particular, our best model reaches the performance of 0.471 (mAP) on AudioSet, 0.854 (mAP) on OpenMIC2018, 0.982 (accuracy) on ESC-50, 0.976 (accuracy) on SCV2, and 0.823 (accuracy) on DCASE2019 Task1A respectively.

Dading Chong, Helin Wang, Peilin Zhou, Qingcheng Zeng• 2022

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy90.7
325
Audio ClassificationAudioSet 20K
mAP34.7
128
Audio ClassificationAudioSet 2M
mAP47.1
79
Audio ClassificationSPC V2
Accuracy97.7
65
Audio ClassificationESC50
Top-1 Acc89.6
64
Keyword SpottingSpeech Commands V2
Accuracy97.7
61
ClassificationAudioSet (test)
mAP47.1
57
Audio RecognitionSpeech Commands V2
Accuracy97.7
43
Sound classificationAudioSet (evaluation)
mAP32.3
39
Audio ClassificationSpeech Commands V2 (test)
Accuracy97.7
35
Showing 10 of 20 rows

Other info

Follow for update