LPT: Less-overfitting Prompt Tuning for Vision-Language Model

About

Vision-language models (VLMs) have demonstrated exceptional generalization capabilities for downstream tasks. Due to its efficiency, prompt learning has gradually become a more effective and efficient method for transferring VLMs to downstream tasks, surpassing traditional finetuning methods. However, during the transfer process, these models are prone to severe overfitting, leading to a significant decline in generalization ability. To address this issue, we propose a framework named LPT, specifically designed for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that may lead to overfitting, thereby guiding the prompts with basic visual concepts. Additionally, to further mitigate overfitting, we have developed a Structural Preservation (SP) constraint at the feature level, which aligns the model's overall feature space structure with the frozen CLIP, endowing the feature space with overall plasticity and enabling effective reshaping of the feature space during optimization. Moreover, we employ Hierarchical Logit (HL) constraint at the output layer to constrain the overall class information in the output, complementing the role of SP at the output end. Extensive experiments across various benchmarks (from base-to-novel, cross-dataset transfer, and domain generalization) demonstrate that our approach significantly improves generalization capability and effectively alleviates overfitting compared to state-of-the-art methods.

Chenhao Ding, Xinyuan Gao, Songlin Dong, Jizhou Han, Qiang Wang, Zhengdong Zhou, Yuhang He, Yihong Gong• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet source to 10 fine-grained target datasets (test)	Caltech101 Accuracy95	37
Image Classification	Food101 novel classes	Accuracy0.9167	36
Image Classification	11 image recognition datasets (Base classes)	Average Accuracy85.1	30
Image Classification	DTD (Novel)	Top-1 Acc65.4	21
Image Classification	Flowers102 (Novel)	Top-1 Accuracy77.9	15
Image Classification	SUN397 (Novel)	Top-1 Acc79.5	15
Image Classification	UCF101 (Novel)	Top-1 Acc80.5	15
Image Classification	OxfordPets (Novel)	Top-1 Accuracy97.87	15
Image Classification	Caltech101 (Novel)	Top-1 Acc94.3	15
Image Classification	FGVC Aircraft Novel	Accuracy38.8	11

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord