E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning

About

As the size of transformer-based models continues to grow, fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E^2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g., 0.32% of model parameters on VTAB-1k). Our code is available at https://github.com/ChengHan111/E2VPT.

Cheng Han, Qifan Wang, Yiming Cui, Zhiwen Cao, Wenguan Wang, Siyuan Qi, Dongfang Liu• 2023

Related benchmarks

Task	Dataset	Result
Semantic segmentation	ADE20K (val)	--	3069
Image Classification	VTAB 1K	Overall Mean Accuracy73.94	281
Multi-Label Classification	NUS-WIDE (test)	mAP67.9	124
Image Classification	FGVC	Accuracy89.93	111
Image Classification	VTAB	Overall Accuracy90.12	103
Multi-Label Classification	MS-COCO 2014 (test)	mAP89.6	81
Visual Task Adaptation	VTAB 1K	Average Accuracy73.94	78
Multi-Label Classification	VOC 07	mAP96.1	73
Fine-grained Image Classification	CUB-200-2011 (test)	Acc89.4	69
Fine-grained Visual Categorization	FGVC	Mean Accuracy89.22	40

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord