Gram-Anchored Prompt Learning for Vision-Language Models via Second-Order Statistics
About
Parameter-efficient prompt learning has become the de facto standard for adapting Vision-Language Models (VLMs) to downstream tasks. Existing approaches predominantly focus on aligning text prompts with first-order visual features (i.e., spatial feature maps). While effective for fine-grained semantic discrimination, we argue that relying solely on first-order information is insufficient for robust adaptation, as these spatially entangled features are highly susceptible to domain shifts and local noise. In this work, we propose \textbf{Gram-Anchored Prompt Learning (GAPL)} for Vision-Language Models via Second-Order Statistics, a framework that synergizes local semantic alignment with global structural consistency. Methodologically, we introduce an additional second-order statistical stream via \textbf{Gram matrices} that augments the standard first-order spatial interaction. By anchoring prompts to these second-order priors, our approach enables language representations to dynamically adapt to statistical distribution shifts across diverse domains. Extensive experiments indicate the effectiveness of the second-order features, and show compelling performances of GAPL on various benchmarks.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | Food101 | -- | 457 | |
| Image Classification | Average 11 datasets | Base Accuracy85.78 | 83 | |
| Fine-grained Image Classification | FGVC Aircraft | Accuracy (All)47.8 | 50 | |
| Satellite Image Classification | EuroSAT | Base Score95.6 | 34 | |
| Image Classification | ImageNet OOD Variants (-V2, -Sketch, -A, -R) | Acc (V2)66.43 | 24 | |
| Texture Classification | DTD | -- | 24 | |
| Fine-grained Image Classification | Stanford Cars | Base Accuracy84.7 | 20 | |
| Fine-grained Image Classification | Flowers-102 | Base Accuracy98.63 | 20 | |
| Fine-grained Image Classification | Oxford Pets | Base Score95.73 | 20 | |
| Image Classification | Caltech101 | Base Accuracy98.9 | 11 |