Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion
About
Prompt tuning has achieved remarkable progress in vision-language models (VLMs) and is recently being adopted for audio-language models (ALMs). However, its generalization ability in ALMs remains largely underexplored. We observe that conventional prompt tuning for ALMs also suffers from the Base-New Tradeoff, and we identify that this issue stems from the disrupted semantic structure of the embedding space. To address this issue, we propose Semantically Expanded Prompt Tuning (SEPT)-a plug-and-play framework that explicitly regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models. SEPT introduces a novel semantic expansion loss with margin constraints that promote intra-class compactness and inter-class separability, thereby enhancing the semantic structure of the prompt embedding space. For comprehensive evaluation, we establish the first benchmark setup for prompt generalization in ALMs, covering both base-to-new generalization and cross-dataset transferability. Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines, while maintaining computational cost during inference. Codes are available in https://github.com/jhyukjang/SEPT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Caltech101 Base and New Classes | Base Accuracy98.23 | 50 | |
| Audio Classification | SESA | Accuracy88.73 | 19 | |
| Audio Classification | Beijing Opera | Base Accuracy97.88 | 13 | |
| Audio Classification | NS-Instruments | Base Accuracy52.15 | 13 | |
| Audio Classification | RAVDESS | Base Accuracy63.55 | 13 | |
| Audio Classification | TUT 2017 | Base Accuracy51.6 | 13 | |
| Audio Classification | ESC50 Actions | Accuracy (Base)87.5 | 13 | |
| Audio Classification | VocalSound | Base Accuracy79.57 | 13 | |
| Audio Classification | Average over 11 datasets | Base Accuracy68.63 | 13 | |
| Audio Classification | GT-Music-Genre | Base Accuracy56.11 | 13 |