ATST: Audio Representation Learning with Teacher-Student Transformer

About

Self-supervised learning (SSL) learns knowledge from a large amount of unlabeled data, and then transfers the knowledge to a specific problem with a limited number of labeled data. SSL has achieved promising results in various domains. This work addresses the problem of segment-level general audio SSL, and proposes a new transformer-based teacher-student SSL model, named ATST. A transformer encoder is developed on a recently emerged teacher-student baseline scheme, which largely improves the modeling capability of pre-training. In addition, a new strategy for positive pair creation is designed to fully leverage the capability of transformer. Extensive experiments have been conducted, and the proposed model achieves the new state-of-the-art results on almost all of the downstream tasks.

Xian Li, Xiaofei Li• 2022

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC-50	Accuracy94.1	441
Audio Classification	AudioSet 20K	mAP37.4	147
Audio Classification	Urbansound8K	Accuracy85.8	126
Musical Instrument Classification	NSynth	Accuracy76.2	117
Audio Classification	SPC V2	Accuracy95.1	65
Music Genre Classification	GTZAN	Accuracy78.9	62
Keyword Spotting	Speech Commands V2	Accuracy98	61
Audio Classification	GTZAN	Accuracy76.4	59
Speaker Identification	VoxCeleb1	Accuracy94.3	58
Speech Classification	VF	Accuracy97.6	47

Showing 10 of 21 rows

Other info

Code

Follow for update

@wizwand_team Discord