Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Generalizable Prompt Tuning for Audio-Language Models via Semantic Expansion

About

Prompt tuning has achieved remarkable progress in vision-language models (VLMs) and is recently being adopted for audio-language models (ALMs). However, its generalization ability in ALMs remains largely underexplored. We observe that conventional prompt tuning for ALMs also suffers from the Base-New Tradeoff, and we identify that this issue stems from the disrupted semantic structure of the embedding space. To address this issue, we propose Semantically Expanded Prompt Tuning (SEPT)-a plug-and-play framework that explicitly regularizes the prompt embedding space by incorporating semantic neighbors generated by large language models. SEPT introduces a novel semantic expansion loss with margin constraints that promote intra-class compactness and inter-class separability, thereby enhancing the semantic structure of the prompt embedding space. For comprehensive evaluation, we establish the first benchmark setup for prompt generalization in ALMs, covering both base-to-new generalization and cross-dataset transferability. Extensive experiments demonstrate that SEPT consistently improves generalization performance across multiple prompt tuning baselines, while maintaining computational cost during inference. Codes are available in https://github.com/jhyukjang/SEPT.

Jaehyuk Jang, Wonjun Lee, Kangwook Ko, Changick Kim• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationCaltech101 Base and New Classes
Base Accuracy98.23
50
Audio ClassificationSESA
Accuracy88.73
19
Audio ClassificationBeijing Opera
Base Accuracy97.88
13
Audio ClassificationNS-Instruments
Base Accuracy52.15
13
Audio ClassificationRAVDESS
Base Accuracy63.55
13
Audio ClassificationTUT 2017
Base Accuracy51.6
13
Audio ClassificationESC50 Actions
Accuracy (Base)87.5
13
Audio ClassificationVocalSound
Base Accuracy79.57
13
Audio ClassificationAverage over 11 datasets
Base Accuracy68.63
13
Audio ClassificationGT-Music-Genre
Base Accuracy56.11
13
Showing 10 of 17 rows

Other info

Follow for update