BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation
About
Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle, we introduce Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"), an audio self-supervised learning method based on BYOL for learning general-purpose audio representation. Unlike most previous audio self-supervised learning methods that rely on agreement of vicinity audio segments or disagreement of remote ones, BYOL-A creates contrasts in an augmented audio segment pair derived from a single audio segment. With a combination of normalization and augmentation techniques, BYOL-A achieves state-of-the-art results in various downstream tasks. Extensive ablation studies also clarified the contribution of each component and their combinations.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | Urbansound8K | Accuracy79.1 | 116 | |
| Musical Instrument Classification | NSynth | Accuracy74.1 | 75 | |
| Audio Classification | SPC V2 | Accuracy92.2 | 65 | |
| Environmental Sound Classification | FSD50K | mAP48.9 | 60 | |
| Speaker Identification | VoxCeleb1 | Accuracy40.1 | 58 | |
| Audio Classification | US8K (test) | R@1 Accuracy0.791 | 41 | |
| Audio Representation Evaluation | HEAR (Holistic Evaluation of Audio Representations) | CREMA-D62.3 | 35 | |
| Environmental Sound Classification | ESC | Top-1 Acc78.9 | 28 | |
| Environmental Sound Classification | Gunshot triangulation | Top-1 Acc87.5 | 23 | |
| Music genre and Speech vs Music classification | GTZAN | Genre Accuracy83.5 | 22 |