Visual Prompt Tuning
About
The current modus operandi in adapting pre-trained models involves updating all the backbone parameters, ie, full fine-tuning. This paper introduces Visual Prompt Tuning (VPT) as an efficient and effective alternative to full fine-tuning for large-scale Transformer models in vision. Taking inspiration from recent advances in efficiently tuning large language models, VPT introduces only a small amount (less than 1% of model parameters) of trainable parameters in the input space while keeping the model backbone frozen. Via extensive experiments on a wide variety of downstream recognition tasks, we show that VPT achieves significant performance gains compared to other parameter efficient tuning protocols. Most importantly, VPT even outperforms full fine-tuning in many cases across model capacities and training data scales, while reducing per-task storage cost.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Semantic segmentation | ADE20K (val) | mIoU49.9 | 2731 | |
| Image Classification | ImageNet-1K 1.0 (val) | Top-1 Accuracy83.58 | 1866 | |
| Mathematical Reasoning | GSM8K | Accuracy75.66 | 983 | |
| Image Classification | ImageNet 1k (test) | Top-1 Accuracy81.68 | 798 | |
| Image Super-resolution | Manga109 | PSNR23.98 | 656 | |
| Image Classification | ImageNet A | Top-1 Acc35.17 | 553 | |
| Image Super-resolution | Set5 (test) | PSNR32.71 | 544 | |
| Image Classification | EuroSAT | Accuracy62.24 | 497 | |
| Image Classification | Food-101 | -- | 494 | |
| Image Classification | ImageNet V2 | Top-1 Acc68.51 | 487 |