Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ATST: Audio Representation Learning with Teacher-Student Transformer

About

Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the knowledge to a specific problem with a limited number of labeled data. SSL has achieved promising results in various domains. This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST. A transformer encoder is developed on a recently emerged teacher-student baseline scheme, which largely improves the modeling capability of pre-training. In addition, a new strategy for positive pair creation is designed to fully leverage the capability of transformer. Extensive experiments have been conducted, and the proposed model achieves the new state-of-the-art results on almost all of the downstream tasks.

Xian Li, Xiaofei Li• 2022

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy94.1
325
Audio ClassificationAudioSet 20K
mAP37.4
128
Audio ClassificationUrbansound8K
Accuracy85.8
116
Musical Instrument ClassificationNSynth
Accuracy76.2
75
Audio ClassificationSPC V2
Accuracy95.1
65
Keyword SpottingSpeech Commands V2
Accuracy98
61
Speaker IdentificationVoxCeleb1
Accuracy94.3
58
Audio ClassificationGTZAN
Accuracy76.4
54
Speech ClassificationVF
Accuracy97.6
47
Audio RecognitionSpeech Commands V2
Accuracy98
43
Showing 10 of 21 rows

Other info

Code

Follow for update