SpecPL: Disentangling Spectral Granularity for Prompt Learning
About
Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51\% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off. Code is released at https://github.com/Mlrac1e/SpecPL-Prompt-Learning.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet to 10 Target Datasets (Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) (test) | ImageNet Accuracy72.12 | 52 | |
| Image Classification | ImageNet Robustness Variants (Adversarial, Rendition, Sketch) V2 (test) | Accuracy (ID)72.12 | 14 | |
| Few-shot Image Classification | 11-Dataset Average (Base) | Accuracy85.9 | 13 | |
| Few-shot Image Classification | ImageNet Base | Accuracy78.05 | 13 | |
| Generalized Zero-shot Image Classification | 11-Dataset Average Generalized | Harmonic Mean81.51 | 13 | |
| Generalized Zero-shot Image Classification | ImageNet Generalized | Harmonic Mean74.63 | 13 | |
| Zero-shot Image Classification | 11-Dataset Average (Novel Split) | Zero-shot Average Accuracy77.55 | 13 | |
| Zero-shot Image Classification | ImageNet (Novel Split) | Accuracy71.5 | 13 |