Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Local-Global Prompt Learning via Sparse Optimal Transport

About

Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: https://github.com/Deniz2304988/SOT-GLP

Deniz Kizaro\u{g}lu, \"Ulku Tuncer K\"u\c{c}\"uktas, Emre \c{C}akmakyurdu, Alptekin Temizel• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationStanford Cars
Accuracy89.2
660
Image ClassificationFlowers102
Accuracy99.2
558
Image ClassificationSUN397
Accuracy78.2
450
Image ClassificationOxfordPets
Accuracy94.8
298
Image ClassificationEuroSAT
Accuracy91.7
226
Image ClassificationFGVC Aircraft
Accuracy57.6
223
Texture ClassificationDTD
Accuracy77.1
131
Action RecognitionUCF101
Accuracy87.5
11
Showing 8 of 8 rows

Other info

Follow for update