Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

EAT: Self-Supervised Pre-Training with Efficient Audio Transformer

About

Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to ~15x compared to existing audio SSL models.

Wenxi Chen, Yuzhe Liang, Ziyang Ma, Zhisheng Zheng, Xie Chen• 2024

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy95.9
325
Audio ClassificationAudioSet 20K
mAP40.28
128
Audio ClassificationESC-50 (test)
Accuracy96.5
84
Audio ClassificationAudioSet 2M
mAP48.6
79
Audio ClassificationSPC V2
Accuracy98.3
65
Keyword SpottingSpeech Commands V2
Accuracy98.3
61
Audio RecognitionSpeech Commands V2
Accuracy98.3
43
Audio ClassificationUS8K (test)
R@1 Accuracy0.9807
41
Audio ClassificationSpeech Commands V2 (test)
Accuracy98.3
35
Audio Event TaggingAudioSet AS-2M (full)
mAP48.6
33
Showing 10 of 28 rows

Other info

Code

Follow for update