Adapting Language-Audio Models as Few-Shot Audio Learners

About

We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data. Specifically, we designed CALM to retrieve the probability distribution of text-audio clips over classes using a set of audio-label pairs and combined it with CLAP's zero-shot classification results. Furthermore, we designed a training-free version of the Treff adapter by using CALM as a cosine similarity measure. Experiments showed that the proposed Treff adapter is comparable and even better than fully-supervised methods and adaptation methods in low-shot and data-abundant scenarios. While the Treff adapter shows that combining large-scale pretraining and rapid learning of domain-specific knowledge is non-trivial for obtaining generic representations for few-shot learning, it is still limited to audio classification tasks. In the future, we will explore how to use audio-language models in diverse audio domains.

Jinhua Liang, Xubo Liu, Haohe Liu, Huy Phan, Emmanouil Benetos, Mark D. Plumbley, Wenwu Wang• 2023

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC50 (test)	R@1 Accuracy0.9448	28
Urban Sound Classification	UrbanSound8K (test)	Accuracy78.1	28
Audio Classification	VocalSound (test)	Accuracy73.94	15
Audio Classification	CREMA-D (test)	Accuracy20.46	9
Audio Classification	ESC50 Actions (test)	Accuracy97.75	7
Audio Classification	RAVDESS (test)	Accuracy0.3523	7
Audio Classification	Beijing-Opera (test)	Accuracy0.8964	7
Audio Classification	GT-Music-Genre (test)	Accuracy61.17	7
Audio Classification	NS-Instruments (test)	Accuracy49.89	7
Audio Classification	TUT 2017 (test)	Accuracy50.47	7

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord