Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks
About
Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks. In this paper, we propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. We combine the well-known wav2vec 2.0 framework, which has shown success in self-supervised learning for speech tasks, with parameter-efficient conformer architectures. Our self-supervised pre-training can reduce the need for labeled data by two-thirds. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset through audio-only self-supervised learning. Our fine-tuned conformers also surpass or match the performance of previous systems pre-trained in a supervised way on several downstream tasks. We further discuss the important design considerations for both pre-training and fine-tuning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 | Accuracy88 | 325 | |
| Audio Classification | AudioSet 20K | mAP27.6 | 128 | |
| Audio Classification | ESC-50 (test) | Accuracy88 | 84 | |
| Audio Classification | AudioSet 2M | mAP41.5 | 79 | |
| Audio Classification | AudioSet-2M (full) | mAP41.1 | 32 | |
| Audio Classification | AudioSet Full (test) | mAP41.1 | 23 | |
| Classification | AudioSet AS-2M | mAP (%)41.1 | 21 | |
| Tagging | MTT Magnatagatune (test) | MTT AUC91.2 | 13 | |
| Audio Classification | AudioSet-20K (test) | mAP27.6 | 13 | |
| Action Recognition | Kinetics-700 (test) | Accuracy23.5 | 11 |