Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Masked Latent Prediction and Classification for Self-Supervised Audio Representation Learning

About

Recently, self-supervised learning methods based on masked latent prediction have proven to encode input data into powerful representations. However, during training, the learned latent space can be further transformed to extract higher-level information that could be more suited for downstream classification tasks. Therefore, we propose a new method: MAsked latenT Prediction And Classification (MATPAC), which is trained with two pretext tasks solved jointly. As in previous work, the first pretext task is a masked latent prediction task, ensuring a robust input representation in the latent space. The second one is unsupervised classification, which utilises the latent representations of the first pretext task to match probability distributions between a teacher and a student. We validate the MATPAC method by comparing it to other state-of-the-art proposals and conducting ablations studies. MATPAC reaches state-of-the-art self-supervised learning results on reference audio classification datasets such as OpenMIC, GTZAN, ESC-50 and US8K and outperforms comparable supervised methods results for musical auto-tagging on Magna-tag-a-tune.

Aurian Quelennec, Pierre Chouteau, Geoffroy Peeters, Slim Essid• 2025

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy93.5
325
Audio ClassificationUrbansound8K
Accuracy89.4
116
Musical Instrument ClassificationNSynth
Accuracy74.6
75
Environmental Sound ClassificationFSD50K
mAP55.2
60
Audio ClassificationGTZAN
Accuracy85.9
54
Audio ClassificationMTT
mAP41.1
11
Audio ClassificationOpenMIC
mAP85.4
11
Showing 7 of 7 rows

Other info

Code

Follow for update