Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VaMP: Variational Multi-Modal Prompt Learning for Vision-Language Models

About

Vision-language models (VLMs), such as CLIP, have shown strong generalization under zero-shot settings, yet adapting them to downstream tasks with limited supervision remains a significant challenge. Existing multi-modal prompt learning methods typically rely on fixed, shared prompts and deterministic parameters, which limits their ability to capture instance-level variation or model uncertainty across diverse tasks and domains. To tackle this issue, we propose a novel Variational Multi-Modal Prompt Learning (VaMP) framework that enables sample-specific, uncertainty-aware prompt tuning in multi-modal representation learning. VaMP generates instance-conditioned prompts by sampling from a learned posterior distribution, allowing the model to personalize its behavior based on input content. To further enhance the integration of local and global semantics, we introduce a class-aware prior derived from the instance representation and class prototype. Building upon these, we formulate prompt tuning as variational inference over latent prompt representations and train the entire framework end-to-end through reparameterized sampling. Experiments on few-shot and domain generalization benchmarks show that VaMP achieves state-of-the-art performance, highlighting the benefits of modeling both uncertainty and task structure in our method. Project page: https://visual-ai.github.io/vamp

Silin Cheng, Kai Han• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationEuroSAT
Accuracy53.82
497
Image ClassificationFlowers102--
478
Image ClassificationImageNet--
429
Image ClassificationDTD
Accuracy46.82
419
Image ClassificationUCF101
Top-1 Acc68.93
404
Image ClassificationFood101
Accuracy86.97
309
Image ClassificationStanfordCars
Accuracy66.1
266
Image ClassificationSUN397
Accuracy68.04
246
Image ClassificationFGVCAircraft
Accuracy26.76
225
Image ClassificationCaltech101
Accuracy94.96
162
Showing 10 of 32 rows

Other info

Follow for update