PALM: Few-Shot Prompt Learning for Audio Language Models

About

Audio-Language Models (ALMs) have recently achieved remarkable success in zero-shot audio recognition tasks, which match features of audio waveforms with class-specific text prompt features, inspired by advancements in Vision-Language Models (VLMs). Given the sensitivity of zero-shot performance to the choice of hand-crafted text prompts, many prompt learning techniques have been developed for VLMs. We explore the efficacy of these approaches in ALMs and propose a novel method, Prompt Learning in Audio Language Models (PALM), which optimizes the feature space of the text encoder branch. Unlike existing methods that work in the input space, our approach results in greater training efficiency. We demonstrate the effectiveness of our approach on 11 audio recognition datasets, encompassing a variety of speech-processing tasks, and compare the results with three baselines in a few-shot learning setup. Our method is either on par with or outperforms other approaches while being computationally less demanding. Code is available at https://asif-hanif.github.io/palm/

Asif Hanif, Maha Tufail Agro, Mohammad Areeb Qazi, Hanan Aldarmaki• 2024

Related benchmarks

Task	Dataset	Result
Audio Classification	ESC50 (test)	R@1 Accuracy0.9593	28
Urban Sound Classification	UrbanSound8K (test)	Accuracy80.77	28
Audio Classification	CREMA-D (test)	Accuracy34.59	9
Audio Classification	RAVDESS (test)	Accuracy0.4596	7
Audio Classification	GT-Music-Genre (test)	Accuracy80	7
Audio Classification	TUT 2017 (test)	Accuracy79.12	7
Audio Classification	VocalSound (test)	Accuracy80.78	7
Audio Classification	NS-Instruments (test)	Accuracy63.83	7
Audio Classification	SESA (test)	Accuracy89.52	7
Audio Classification	ESC50 Actions (test)	Accuracy96.58	7

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord