E^2VPT: An Effective and Efficient Approach for Visual Prompt Tuning
About
As the size of transformer-based models continues to grow, fine-tuning these large-scale pretrained vision models for new tasks has become increasingly parameter-intensive. Parameter-efficient learning has been developed to reduce the number of tunable parameters during fine-tuning. Although these methods show promising results, there is still a significant performance gap compared to full fine-tuning. To address this challenge, we propose an Effective and Efficient Visual Prompt Tuning (E^2VPT) approach for large-scale transformer-based model adaptation. Specifically, we introduce a set of learnable key-value prompts and visual prompts into self-attention and input layers, respectively, to improve the effectiveness of model fine-tuning. Moreover, we design a prompt pruning procedure to systematically prune low importance prompts while preserving model performance, which largely enhances the model's efficiency. Empirical results demonstrate that our approach outperforms several state-of-the-art baselines on two benchmarks, with considerably low parameter usage (e.g., 0.32% of model parameters on VTAB-1k). Our code is available at https://github.com/ChengHan111/E2VPT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | -- | 2731 | |
| Image Classification | VTAB 1K | Overall Mean Accuracy73.94 | 204 | |
| Multi-Label Classification | NUS-WIDE (test) | mAP67.9 | 112 | |
| Multi-Label Classification | MS-COCO 2014 (test) | mAP89.6 | 81 | |
| Visual Task Adaptation | VTAB 1K | Average Accuracy73.94 | 78 | |
| Fine-grained Image Classification | CUB-200-2011 (test) | Consistency Score27.5 | 65 | |
| Multi-Label Classification | VOC 07 | mAP96.1 | 61 | |
| Fine-grained Visual Categorization | FGVC | Mean Accuracy89.22 | 40 | |
| Image Classification | FGVC | Accuracy89.22 | 38 | |
| Multi-Label Classification | Visual Genome VG256 (test) | mAP49.2 | 24 |