Adapting Language-Audio Models as Few-Shot Audio Learners
About
We presented the Treff adapter, a training-efficient adapter for CLAP, to boost zero-shot classification performance by making use of a small set of labelled data. Specifically, we designed CALM to retrieve the probability distribution of text-audio clips over classes using a set of audio-label pairs and combined it with CLAP's zero-shot classification results. Furthermore, we designed a training-free version of the Treff adapter by using CALM as a cosine similarity measure. Experiments showed that the proposed Treff adapter is comparable and even better than fully-supervised methods and adaptation methods in low-shot and data-abundant scenarios. While the Treff adapter shows that combining large-scale pretraining and rapid learning of domain-specific knowledge is non-trivial for obtaining generic representations for few-shot learning, it is still limited to audio classification tasks. In the future, we will explore how to use audio-language models in diverse audio domains.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Audio Classification | ESC50 (test) | R@1 Accuracy0.9448 | 28 | |
| Urban Sound Classification | UrbanSound8K (test) | Accuracy78.1 | 28 | |
| Audio Classification | CREMA-D (test) | Accuracy20.46 | 9 | |
| Audio Classification | ESC50 Actions (test) | Accuracy97.75 | 7 | |
| Audio Classification | RAVDESS (test) | Accuracy0.3523 | 7 | |
| Audio Classification | VocalSound (test) | Accuracy73.94 | 7 | |
| Audio Classification | Beijing-Opera (test) | Accuracy0.8964 | 7 | |
| Audio Classification | GT-Music-Genre (test) | Accuracy61.17 | 7 | |
| Audio Classification | NS-Instruments (test) | Accuracy49.89 | 7 | |
| Audio Classification | TUT 2017 (test) | Accuracy50.47 | 7 |