Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Learning Hierarchical Prompt with Structured Linguistic Knowledge for Vision-Language Models

About

Prompt learning has become a prevalent strategy for adapting vision-language foundation models to downstream tasks. As large language models (LLMs) have emerged, recent studies have explored the use of category-related descriptions as input to enhance prompt effectiveness. Nevertheless, conventional descriptions fall short of structured information that effectively represents the interconnections among entities or attributes linked to a particular category. To address this limitation and prioritize harnessing structured knowledge, this paper advocates for leveraging LLMs to build a graph for each description to model the entities and attributes describing the category, as well as their correlations. Preexisting prompt tuning methods exhibit inadequacies in managing this structured knowledge. Consequently, we propose a novel approach called Hierarchical Prompt Tuning (HPT), which enables simultaneous modeling of both structured and conventional linguistic knowledge. Specifically, we introduce a relationship-guided attention module to capture pair-wise associations among entities and attributes for low-level prompt learning. In addition, by incorporating high-level and global-level prompts modeling overall semantics, the proposed hierarchical structure forges cross-level interlinks and empowers the model to handle more complex and long-term relationships. Extensive experiments demonstrate that our HPT shows strong effectiveness and generalizes much better than existing SOTA methods. Our code is available at https://github.com/Vill-Lab/2024-AAAI-HPT.

Yubin Wang, Xinyang Jiang, De Cheng, Dongsheng Li, Cairong Zhao• 2023

Related benchmarks

TaskDatasetResultRank
Texture ClassificationDTD
Accuracy83.84
108
Action RecognitionUCF101
Base Accuracy86.52
62
Image ClassificationImageNet to 10 Target Datasets (Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) (test)
ImageNet Accuracy71.72
48
Image ClassificationImageNet Domain Generalization OOD Variants (test)
ImageNet Acc71.72
43
Fine grained classificationFGVCAircraft Base-to-New
Base Accuracy42.68
23
Scene recognitionSUN397 base-to-new
Base Accuracy82.57
16
Fine grained classificationFlower102
Top-1 Acc98.17
13
Fine grained classificationFood101
Base Accuracy90.46
13
Fine grained classificationOxfordPets
Base Accuracy95.78
4
Fine grained classificationStanfordCars
Base Accuracy76.95
4
Showing 10 of 12 rows

Other info

Code

Follow for update