Skip Tuning: Pre-trained Vision-Language Models are Effective and Efficient Adapters Themselves

About

Prompt tuning (PT) has long been recognized as an effective and efficient paradigm for transferring large pre-trained vision-language models (VLMs) to downstream tasks by learning a tiny set of context vectors. Nevertheless, in this work, we reveal that freezing the parameters of VLMs during learning the context vectors neither facilitates the transferability of pre-trained knowledge nor improves the memory and time efficiency significantly. Upon further investigation, we find that reducing both the length and width of the feature-gradient propagation flows of the full fine-tuning (FT) baseline is key to achieving effective and efficient knowledge transfer. Motivated by this, we propose Skip Tuning, a novel paradigm for adapting VLMs to downstream tasks. Unlike existing PT or adapter-based methods, Skip Tuning applies Layer-wise Skipping (LSkip) and Class-wise Skipping (CSkip) upon the FT baseline without introducing extra context vectors or adapter modules. Extensive experiments across a wide spectrum of benchmarks demonstrate the superior effectiveness and efficiency of our Skip Tuning over both PT and adapter-based methods. Code: https://github.com/Koorye/SkipTuning.

Shihan Wu, Ji Zhang, Pengpeng Zeng, Lianli Gao, Jingkuan Song, Heng Tao Shen• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	Accuracy87.43	660
Image Classification	DTD	Accuracy74.07	599
Image Classification	Food-101	Accuracy87.13	570
Image Classification	EuroSAT	Accuracy91.57	569
Image Classification	Flowers102	--	558
Image Classification	UCF101	Top-1 Acc87.7	527
Image Classification	Food101	--	457
Image Classification	SUN397	Accuracy76.8	425
Image Classification	StanfordCars	--	384
Image Classification	Oxford-IIIT Pets	Accuracy94	378

Showing 10 of 54 rows

Other info

Code

Follow for update

@wizwand_team Discord