Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SpecPL: Disentangling Spectral Granularity for Prompt Learning

About

Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51\% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off. Code is released at https://github.com/Mlrac1e/SpecPL-Prompt-Learning.

Jingtao Zhou, Xirui Kang, Feiyang Huang, Lai-Man Po• 2026

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet to 10 Target Datasets (Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) (test)
ImageNet Accuracy72.12
52
Image ClassificationImageNet Robustness Variants (Adversarial, Rendition, Sketch) V2 (test)
Accuracy (ID)72.12
14
Few-shot Image Classification11-Dataset Average (Base)
Accuracy85.9
13
Few-shot Image ClassificationImageNet Base
Accuracy78.05
13
Generalized Zero-shot Image Classification11-Dataset Average Generalized
Harmonic Mean81.51
13
Generalized Zero-shot Image ClassificationImageNet Generalized
Harmonic Mean74.63
13
Zero-shot Image Classification11-Dataset Average (Novel Split)
Zero-shot Average Accuracy77.55
13
Zero-shot Image ClassificationImageNet (Novel Split)
Accuracy71.5
13
Showing 8 of 8 rows

Other info

Follow for update