EAT: Self-Supervised Pre-Training with Efficient Audio Transformer
About
Audio self-supervised learning (SSL) pre-training, which aims to learn good representations from unlabeled audio, has made remarkable progress. However, the extensive computational demands during pre-training pose a significant barrier to the potential application and optimization of audio SSL models. In this paper, inspired by the success of data2vec 2.0 in image modality and Audio-MAE in audio modality, we introduce Efficient Audio Transformer (EAT) to further improve the effectiveness and efficiency in audio SSL. The proposed EAT adopts the bootstrap self-supervised training paradigm to the audio domain. A novel Utterance-Frame Objective (UFO) is designed to enhance the modeling capability of acoustic events. Furthermore, we reveal that the masking strategy is critical in audio SSL pre-training, and superior audio representations can be obtained with large inverse block masks. Experiment results demonstrate that EAT achieves state-of-the-art (SOTA) performance on a range of audio-related tasks, including AudioSet (AS-2M, AS-20K), ESC-50, and SPC-2, along with a significant pre-training speedup up to ~15x compared to existing audio SSL models.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 | Accuracy95.9 | 325 | |
| Audio Classification | AudioSet 20K | mAP40.28 | 128 | |
| Audio Classification | ESC-50 (test) | Accuracy96.5 | 84 | |
| Audio Classification | AudioSet 2M | mAP48.6 | 79 | |
| Audio Classification | SPC V2 | Accuracy98.3 | 65 | |
| Keyword Spotting | Speech Commands V2 | Accuracy98.3 | 61 | |
| Audio Recognition | Speech Commands V2 | Accuracy98.3 | 43 | |
| Audio Classification | US8K (test) | R@1 Accuracy0.9807 | 41 | |
| Audio Classification | Speech Commands V2 (test) | Accuracy98.3 | 35 | |
| Audio Event Tagging | AudioSet AS-2M (full) | mAP48.6 | 33 |