Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Adapting Language-Audio Models as Few-Shot Audio Learners

About

We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data. Specifically, we designed CALM to retrieve the probability distribution of text-audio clips over classes using a set of audio-label pairs and combined it with CLAP's zero-shot classification results. Furthermore, we designed a training-free version of the Treff adapter by using CALM as a cosine similarity measure. Experiments showed that the proposed Treff adapter is comparable and even better than fully-supervised methods and adaptation methods in low-shot and data-abundant scenarios. While the Treff adapter shows that combining large-scale pretraining and rapid learning of domain-specific knowledge is non-trivial for obtaining generic representations for few-shot learning, it is still limited to audio classification tasks. In the future, we will explore how to use audio-language models in diverse audio domains.

Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang• 2023

Related benchmarks

TaskDatasetResultRank
Audio ClassificationESC50 (test)
R@1 Accuracy0.9448
28
Urban Sound ClassificationUrbanSound8K (test)
Accuracy78.1
28
Audio ClassificationCREMA-D (test)
Accuracy20.46
9
Audio ClassificationESC50 Actions (test)
Accuracy97.75
7
Audio ClassificationRAVDESS (test)
Accuracy0.3523
7
Audio ClassificationVocalSound (test)
Accuracy73.94
7
Audio ClassificationBeijing-Opera (test)
Accuracy0.8964
7
Audio ClassificationGT-Music-Genre (test)
Accuracy61.17
7
Audio ClassificationNS-Instruments (test)
Accuracy49.89
7
Audio ClassificationTUT 2017 (test)
Accuracy50.47
7
Showing 10 of 11 rows

Other info

Follow for update