Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

M2D-CLAP: Masked Modeling Duo Meets CLAP for Learning General-purpose Audio-Language Representation

About

Contrastive language-audio pre-training (CLAP) enables zero-shot (ZS) inference of audio and exhibits promising performance in several classification tasks. However, conventional audio representations are still crucial for many tasks where ZS is not applicable (e.g., regression problems). Here, we explore a new representation, a general-purpose audio-language representation, that performs well in both ZS and transfer learning. To do so, we propose a new method, M2D-CLAP, which combines self-supervised learning Masked Modeling Duo (M2D) and CLAP. M2D learns an effective representation to model audio signals, and CLAP aligns the representation with text embedding. As a result, M2D-CLAP learns a versatile representation that allows for both ZS and transfer learning. Experiments show that M2D-CLAP performs well on linear evaluation, fine-tuning, and ZS classification with a GTZAN state-of-the-art of 75.17%, thus achieving a general-purpose audio-language representation.

Daisuke Niizumi, Daiki Takeuchi, Yasunori Ohishi, Noboru Harada, Masahiro Yasuda, Shunsuke Tsubaki, Keisuke Imoto• 2024

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy97.4
325
Audio ClassificationAudioSet 20K
mAP41.8
128
Audio ClassificationUrbansound8K
Accuracy88.8
116
Audio ClassificationAudioSet 2M
mAP48.5
79
Musical Instrument ClassificationNSynth
Accuracy78
75
Audio ClassificationSPC V2
Accuracy98.3
65
Audio ClassificationESC50
Top-1 Acc77.8
64
Keyword SpottingSpeech Commands V2
Accuracy98.3
61
Speaker IdentificationVoxCeleb1
Accuracy95.5
58
ClassificationAudioSet (test)
mAP27.24
57
Showing 10 of 42 rows

Other info

Code

Follow for update