Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

BYOL for Audio: Self-Supervised Learning for General-Purpose Audio Representation

About

Inspired by the recent progress in self-supervised learning for computer vision that generates supervision using data augmentations, we explore a new general-purpose audio representation learning approach. We propose learning general-purpose audio representation from a single audio segment without expecting relationships between different time segments of audio samples. To implement this principle, we introduce Bootstrap Your Own Latent (BYOL) for Audio (BYOL-A, pronounced "viola"), an audio self-supervised learning method based on BYOL for learning general-purpose audio representation. Unlike most previous audio self-supervised learning methods that rely on agreement of vicinity audio segments or disagreement of remote ones, BYOL-A creates contrasts in an augmented audio segment pair derived from a single audio segment. With a combination of normalization and augmentation techniques, BYOL-A achieves state-of-the-art results in various downstream tasks. Extensive ablation studies also clarified the contribution of each component and their combinations.

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Kunio Kashino• 2021

Related benchmarks

TaskDatasetResultRank
Audio ClassificationUrbansound8K
Accuracy79.1
116
Musical Instrument ClassificationNSynth
Accuracy74.1
75
Audio ClassificationSPC V2
Accuracy92.2
65
Environmental Sound ClassificationFSD50K
mAP48.9
60
Speaker IdentificationVoxCeleb1
Accuracy40.1
58
Audio ClassificationUS8K (test)
R@1 Accuracy0.791
41
Audio Representation EvaluationHEAR (Holistic Evaluation of Audio Representations)
CREMA-D62.3
35
Environmental Sound ClassificationESC
Top-1 Acc78.9
28
Environmental Sound ClassificationGunshot triangulation
Top-1 Acc87.5
23
Music genre and Speech vs Music classificationGTZAN
Genre Accuracy83.5
22
Showing 10 of 21 rows

Other info

Follow for update