Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

M2D-CLAP: Exploring General-purpose Audio-Language Representations Beyond CLAP

About

Contrastive language-audio pre-training (CLAP), which learns audio-language representations by aligning audio and text in a common feature space, has become popular for solving audio tasks. However, CLAP's audio features lack generalizability, whereas self-supervised learning (SSL) models offer general-purpose features that perform well across diverse audio tasks. We aim to develop a broadly applicable audio representation and hypothesize that a model that learns both general audio and CLAP features should achieve our goal, which we call a general-purpose audio-language representation. To implement our hypothesis, we propose M2D-CLAP, the first approach to jointly learn effective general audio and CLAP features. It extends an SSL masked modeling duo (M2D) by incorporating CLAP and utilizes LLM-based sentence embeddings. The training process consists of multiple stages. In the first stage, generalizable audio features are pre-trained via a multitask objective combining M2D and CLAP, with CLAP leveraging LLM-based semantic embeddings to distill semantic knowledge into them. In the following stages, CLAP features are pre-trained and refined with guidance from the learned audio features. Experiments demonstrated that M2D-CLAP learns high-performing general audio features (e.g., AudioSet mAP of 49.0, SOTA results in music tasks) and CLAP features, thereby enabling a general-purpose audio-language representation.

Daisuke Niizumi, Daiki Takeuchi, Masahiro Yasuda, Binh Thien Nguyen, Yasunori Ohishi, Noboru Harada• 2025

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC-50
Accuracy98.5
374
Text-to-Audio RetrievalAudioCaps (test)
Recall@141.9
152
Audio CaptioningAudioCaps (test)
CIDEr72.4
140
Audio ClassificationUrbansound8K
Accuracy89.7
126
Musical Instrument ClassificationNSynth
Accuracy76.7
106
Audio ClassificationESC-50 (test)
Accuracy98.5
87
Audio-to-Text RetrievalClotho (test)
R@124.9
85
Audio-to-Text RetrievalAudioCaps (test)
R@159.2
69
Text-to-Audio RetrievalClotho (test)
R@120.1
69
Audio ClassificationSPC V2
Accuracy98.4
65
Showing 10 of 37 rows

Other info

Code

Follow for update