Self-supervised Audio Teacher-Student Transformer for Both Clip-level and Frame-level Tasks
About
Self-supervised learning (SSL) has emerged as a popular approach for learning audio representations. One goal of audio self-supervised pre-training is to transfer knowledge to downstream audio tasks, generally including clip-level and frame-level tasks. While frame-level tasks are important for fine-grained acoustic scene/event understanding, prior studies primarily evaluate on clip-level downstream tasks. In order to tackle both clip-level and frame-level tasks, this paper proposes Audio Teacher-Student Transformer (ATST), with a clip-level version (named ATST-Clip) and a frame-level version (named ATST-Frame), responsible for learning clip-level and frame-level representations, respectively. Both methods use a Transformer encoder and a teacher-student training scheme. We have carefully designed the view creation strategy for ATST-Clip and ATST-Frame. Specifically, ATST-Clip uses segment-wise data augmentations, and ATST-Frame integrates frame-wise data augmentations and masking. Experimental results show that our ATST-Frame model obtains state-of-the-art (SOTA) performances on most of the clip-level and frame-level downstream tasks. Especially, it outperforms other models by a large margin on the frame-level sound event detection task. In addition, the performance can be further improved by combining the two models through knowledge distillation. Our code is available online.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC-50 | Accuracy94.1 | 325 | |
| Audio Classification | AudioSet 20K | mAP40.5 | 128 | |
| Audio Classification | Urbansound8K | Accuracy85.8 | 116 | |
| Audio Classification | AudioSet 2M | mAP49.7 | 79 | |
| Musical Instrument Classification | NSynth | Accuracy79.8 | 75 | |
| Audio Classification | SPC V2 | Accuracy98.4 | 65 | |
| Keyword Spotting | Speech Commands V2 | Accuracy98.4 | 61 | |
| Environmental Sound Classification | FSD50K | mAP65.5 | 60 | |
| Speaker Identification | VoxCeleb1 | Accuracy97.5 | 58 | |
| Audio Classification | GTZAN | Accuracy82.9 | 54 |