Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Auxiliary Descriptive Knowledge for Few-Shot Adaptation of Vision-Language Model

About

Despite the impressive zero-shot capabilities of Vision-Language Models (VLMs), they often struggle in downstream tasks with distribution shifts from the pre-training data. Few-Shot Adaptation (FSA-VLM) has emerged as a key solution, typically using Parameter-Efficient Fine-Tuning (PEFT) to adapt models with minimal data. However, these PEFT methods are constrained by their reliance on fixed, handcrafted prompts, which are often insufficient to understand the semantics of classes. While some studies have proposed leveraging image-induced prompts to provide additional clues for classification, they introduce prohibitive computational overhead at inference. Therefore, we introduce Auxiliary Descriptive Knowledge (ADK), a novel framework that efficiently enriches text representations without compromising efficiency. ADK first leverages a Large Language Model to generate a rich set of descriptive prompts for each class offline. These pre-computed features are then deployed in two ways: (1) as Compositional Knowledge, an averaged representation that provides rich semantics, especially beneficial when class names are ambiguous or unfamiliar to the VLM; and (2) as Instance-Specific Knowledge, where a lightweight, non-parametric attention mechanism dynamically selects the most relevant descriptions for a given image. This approach provides two additional types of knowledge alongside the handcrafted prompt, thereby facilitating category distinction across various domains. Also, ADK acts as a parameter-free, plug-and-play component that enhances existing PEFT methods. Extensive experiments demonstrate that ADK consistently boosts the performance of multiple PEFT baselines, setting a new state-of-the-art across various scenarios.

SuBeen Lee, GilHan Park, WonJun Moon, Hyun Seok Seong, Jae-Pil Heo• 2025

Related benchmarks

TaskDatasetResultRank
Image Classification11 Downstream Classification Datasets (ImageNet, Flowers102, DTD, OxfordPets, StanfordCars, UCF101, Caltech101, Food101, SUN397, FGVC-Aircraft, EuroSAT) standard (test)
DTD Accuracy75
115
Image ClassificationAverage 11 datasets--
52
Few-shot Image Classificationall-to-all setting
Accuracy84
31
Image ClassificationAverage across 10 datasets
Average Accuracy67.4
14
Image Classification11-dataset VLM Generalization Suite base-to-novel
Avg Acc81.6
13
Base-to-novel ClassificationBase-to-novel Evaluation Suite (Novel Classes)
Avg Accuracy73.9
8
Base-to-novel ClassificationBase-to-novel Evaluation Suite ViT-B/32 backbone (base classes)
Average Accuracy82.6
8
Base-to-novel ClassificationBase-to-novel Evaluation Suite ViT-B/32 backbone (Harmonic Mean)
Average Score78
8
Image ClassificationAverage 11 datasets (Novel)
Top-1 Accuracy (Avg)82.7
8
Image Classification11 image recognition datasets (Base classes)
Average Accuracy89.2
8
Showing 10 of 20 rows

Other info

Follow for update