AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation

About

Pre-trained vision-language models (VLMs) have shown impressive results in various visual classification tasks. However, we often fail to fully unleash their potential when adapting them for new concept understanding due to limited information on new classes. To address this limitation, we introduce a novel adaptation framework, AWT (Augment, Weight, then Transport). AWT comprises three key components: augmenting inputs with diverse visual perspectives and enriched class descriptions through image transformations and language models; dynamically weighting inputs based on the prediction entropy; and employing optimal transport to mine semantic correlations in the vision-language space. AWT can be seamlessly integrated into various VLMs, enhancing their zero-shot capabilities without additional training and facilitating few-shot learning through an integrated multimodal adapter module. We verify AWT in multiple challenging scenarios, including zero-shot and few-shot image classification, zero-shot video action recognition, and out-of-distribution generalization. AWT consistently outperforms the state-of-the-art methods in each setting. In addition, our extensive studies further demonstrate AWT's effectiveness and adaptability across different VLMs, architectures, and scales.

Yuhan Zhu, Yuyang Ji, Zhiyu Zhao, Gangshan Wu, Limin Wang• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	Stanford Cars	Accuracy87.59	705
Image Classification	DTD	--	610
Image Classification	ImageNet-1K	Top-1 Acc71.32	600
Image Classification	Food-101	Accuracy88.11	590
Image Classification	EuroSAT	Accuracy93.68	569
Image Classification	UCF101	Top-1 Acc87.53	529
Image Classification	SUN397	Accuracy77.57	425
Image Classification	StanfordCars	Accuracy69.93	384
Image Classification	CUB-200 2011	Accuracy59.54	381
Image Classification	CUB	Accuracy60.2	351

Showing 10 of 41 rows

Other info

Code

Follow for update

@wizwand_team Discord