SpecPL: Disentangling Spectral Granularity for Prompt Learning

About

Existing prompt learning for VLMs exhibits a modality asymmetry, predominantly optimizing text tokens while still relying on frozen visual encoder as holistic extractor and neglecting the spectral granularity essential for fine-grained discrimination. To bridge this, we introduce Disentangling Spectral Granularity for Prompt Learning (SpecPL), which approaches prompt learning from a novel spectral perspective via Counterfactual Granule Supervision. Specifically, we leverage a frozen VAE to decompose visual signals into semantic low-frequency bands and granular high-frequency details. A frozen Visual Semantic Bank anchors text representations to universal low-frequency invariants, mitigating overfitting. Crucially, fine-grained discrimination is driven by counterfactual granule training: by permuting high-frequency signals, we compel the model to explicitly distinguish visual granularity from semantic invariance. Uniquely, SpecPL serves as a universal plug-and-play booster, revitalizing text-oriented baselines like CoOp and MaPLe via visual-side guidance. Experiments on 11 benchmarks demonstrate competitive state-of-the-art performance, achieving a new performance ceiling of 81.51\% harmonic-mean accuracy. These results validate that spectral disentanglement with counterfactual supervision effectively bridges the gap in the stability-generalization trade-off. Code is released at https://github.com/Mlrac1e/SpecPL-Prompt-Learning.

Jingtao Zhou, Xirui Kang, Feiyang Huang, Lai-Man Po• 2026

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet to 10 Target Datasets (Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) (test)	ImageNet Accuracy72.12	52
Image Classification	ImageNet Robustness Variants (Adversarial, Rendition, Sketch) V2 (test)	Accuracy (ID)72.12	14
Few-shot Image Classification	11-Dataset Average (Base)	Accuracy85.9	13
Few-shot Image Classification	ImageNet Base	Accuracy78.05	13
Generalized Zero-shot Image Classification	11-Dataset Average Generalized	Harmonic Mean81.51	13
Generalized Zero-shot Image Classification	ImageNet Generalized	Harmonic Mean74.63	13
Zero-shot Image Classification	11-Dataset Average (Novel Split)	Zero-shot Average Accuracy77.55	13
Zero-shot Image Classification	ImageNet (Novel Split)	Accuracy71.5	13

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord