Hierarchical Cross-modal Prompt Learning for Vision-Language Models

About

Pre-trained Vision-Language Models (VLMs) such as CLIP have shown excellent generalization abilities. However, adapting these large-scale models to downstream tasks while preserving their generalization capabilities remains challenging. Although prompt learning methods have shown promise, they suffer from two fundamental bottlenecks that limit generalization: (a) modality isolation, and (b) hierarchical semantic decay. To address these limitations, we propose HiCroPL, a Hierarchical Cross-modal Prompt Learning framework that establishes bidirectional knowledge flow between text and vision modalities, enabling them to refine their semantics mutually. HiCroPL routes knowledge flows by leveraging the complementary strengths of text and vision. In early layers, text prompts inject relatively clear semantics into visual prompts through a hierarchical knowledge mapper, enhancing the representation of low-level visual semantics. In later layers, visual prompts encoding specific task-relevant objects flow back to refine text prompts, enabling deeper alignment. Crucially, our hierarchical knowledge mapper allows representations at multi-scales to be fused, ensuring that deeper representations retain transferable shallow semantics thereby enhancing generalization. We further introduce a lightweight layer-specific knowledge proxy to enable efficient cross-modal interactions. Extensive evaluations across four tasks demonstrate HiCroPL's superior performance, achieving state-of-the-art results on 11 benchmarks with significant improvements. Code is available at: https://github.com/zzeoZheng/HiCroPL.

Hao Zheng, Shunzhi Yang, Zhuoxin He, Jinfeng Yang, Zhenhua Huang• 2025

Related benchmarks

Task	Dataset	Result
Image Classification	Food101	--	457
Image Classification	Average 11 datasets	Base Accuracy85.89	95
Image Classification	Caltech101	Base Accuracy98.77	68
Fine-grained Image Classification	FGVC Aircraft	Accuracy (All)48.38	50
Satellite Image Classification	EuroSAT	Base Score96.29	47
Action Recognition	UCF101	Base Accuracy87.95	34
Image Classification	ImageNet OOD Variants (-V2, -Sketch, -A, -R)	Acc (V2)64.33	34
Texture Classification	DTD	Base Accuracy85.07	27
Fine-grained Image Classification	Stanford Cars	Base Accuracy81.51	27
Fine-grained Image Classification	Oxford Pets	Base Score96.28	20

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord