Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Conformer-Based Self-Supervised Learning for Non-Speech Audio Tasks

About

Representation learning from unlabeled data has been of major interest in artificial intelligence research. While self-supervised speech representation learning has been popular in the speech research community, very few works have comprehensively analyzed audio representation learning for non-speech audio tasks. In this paper, we propose a self-supervised audio representation learning method and apply it to a variety of downstream non-speech audio tasks. We combine the well-known wav2vec 2.0 framework, which has shown success in self-supervised learning for speech tasks, with parameter-efficient conformer architectures. Our self-supervised pre-training can reduce the need for labeled data by two-thirds. On the AudioSet benchmark, we achieve a mean average precision (mAP) score of 0.415, which is a new state-of-the-art on this dataset through audio-only self-supervised learning. Our fine-tuned conformers also surpass or match the performance of previous systems pre-trained in a supervised way on several downstream tasks. We further discuss the important design considerations for both pre-training and fine-tuning.

Sangeeta Srivastava, Yun Wang, Andros Tjandra, Anurag Kumar, Chunxi Liu, Kritika Singh, Yatharth Saraf• 2021

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy88
325
Audio ClassificationAudioSet 20K
mAP27.6
128
Audio ClassificationESC-50 (test)
Accuracy88
84
Audio ClassificationAudioSet 2M
mAP41.5
79
Audio ClassificationAudioSet-2M (full)
mAP41.1
32
Audio ClassificationAudioSet Full (test)
mAP41.1
23
ClassificationAudioSet AS-2M
mAP (%)41.1
21
TaggingMTT Magnatagatune (test)
MTT AUC91.2
13
Audio ClassificationAudioSet-20K (test)
mAP27.6
13
Action RecognitionKinetics-700 (test)
Accuracy23.5
11
Showing 10 of 13 rows

Other info

Follow for update