Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models

About

Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the open-world generalization has gained increasing popularity due to its practical value. However, performance advancements are limited when relying solely on intricate algorithmic designs for a single model, even one exhibiting strong performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. The affirmative findings motivate us to address the generalization problem from a novel perspective, i.e., ensemble of pre-trained VLMs. We introduce three customized ensemble strategies, each tailored to one specific scenario. Firstly, we introduce the zero-shot ensemble, automatically adjusting the logits of different models based on their confidence when only pre-trained VLMs are available. Furthermore, for scenarios with extra few-shot samples, we propose the training-free and tuning ensemble, offering flexibility based on the availability of computing resources. The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance. Notably, this work represents an initial stride toward enhancing the generalization performance of VLMs via ensemble. The code is available at https://github.com/zhiheLu/Ensemble_VLM.git.

Zhihe Lu, Jiawang Bai, Xin Li, Zeyu Xiao, Xinchao Wang• 2023

Related benchmarks

TaskDatasetResultRank
Fine-grained Image ClassificationStanford Cars (test)
Accuracy70.76
348
Image ClassificationImageNet
Top-1 Accuracy73.25
324
Action RecognitionUCF101 (test)
Accuracy69.84
307
Fine-grained visual classificationFGVC-Aircraft (test)
Top-1 Acc25.68
287
Base-to-New GeneralizationImageNet
Base Accuracy78.74
67
Image ClassificationEuroSAT Base-to-New
Base Score95.52
65
Base-to-New GeneralizationFGVCAircraft
Base Performance43.22
64
Image ClassificationEuroSAT (test)
Accuracy50.2
59
Image ClassificationImageNet Domain Generalization (Source: ImageNet, Targets: ImageNetV2, ImageNet-Sketch, ImageNet-A, ImageNet-R) (test)
Accuracy (ImageNetV2)65.73
53
Image ClassificationImageNet to 10 Target Datasets (Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) (test)
ImageNet Accuracy70.88
48
Showing 10 of 22 rows

Other info

Code

Follow for update