Local-Global Prompt Learning via Sparse Optimal Transport

About

Few-shot adaptation of vision-language models (VLMs) like CLIP typically relies on learning textual prompts matched to global image embeddings. Recent works extend this paradigm by incorporating local image-text alignment to capture fine-grained visual cues, yet these approaches often select local regions independently for each prompt, leading to redundant local feature usage and prompt overlap. We propose SOT-GLP, which introduces a shared sparse patch support and balanced optimal transport allocation to explicitly partition salient visual regions among class-specific local prompts while preserving global alignment. Our method learns shared global prompts and class-specific local prompts. The global branch maintains standard image-text matching for robust category-level alignment. The local branch constructs a class-conditioned sparse patch set using V-V attention and aligns it to multiple class-specific prompts via balanced entropic optimal transport, yielding a soft partition of patches that prevents prompt overlap and collapse. We evaluate our method on two complementary objectives: (i) few-shot classification accuracy on 11 standard benchmarks and (ii) out-of-distribution (OOD) detection. On the standard 11-dataset benchmark with 16-shot ViT-B/16, SOT-GLP achieves 85.1% average accuracy, outperforming prior prompt-learning methods. We identify a distinct accuracy-robustness trade-off in prompt learning: while learnable projections optimize in-distribution fit, they alter the foundational feature space. We demonstrate that a projection-free local alignment preserves the native geometry of the CLIP manifold, yielding state-of-the-art OOD detection performance (94.2% AUC) that surpasses fully adapted models. Implementation available at: https://github.com/Deniz2304988/SOT-GLP

Deniz Kizaro\u{g}lu, \"Ulku Tuncer K\"u\c{c}\"uktas, Emre \c{C}akmakyurdu, Alptekin Temizel• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	Accuracy89.2	705
Image Classification	Flowers102	Accuracy99.2	558
Image Classification	SUN397	Accuracy78.2	450
Image Classification	OxfordPets	Accuracy94.8	298
Image Classification	EuroSAT	Accuracy91.7	226
Image Classification	FGVC Aircraft	Accuracy57.6	223
Texture Classification	DTD	Accuracy77.1	131
Action Recognition	UCF101	Accuracy87.5	11

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord