Beyond Sole Strength: Customized Ensembles for Generalized Vision-Language Models
About
Fine-tuning pre-trained vision-language models (VLMs), e.g., CLIP, for the open-world generalization has gained increasing popularity due to its practical value. However, performance advancements are limited when relying solely on intricate algorithmic designs for a single model, even one exhibiting strong performance, e.g., CLIP-ViT-B/16. This paper, for the first time, explores the collaborative potential of leveraging much weaker VLMs to enhance the generalization of a robust single model. The affirmative findings motivate us to address the generalization problem from a novel perspective, i.e., ensemble of pre-trained VLMs. We introduce three customized ensemble strategies, each tailored to one specific scenario. Firstly, we introduce the zero-shot ensemble, automatically adjusting the logits of different models based on their confidence when only pre-trained VLMs are available. Furthermore, for scenarios with extra few-shot samples, we propose the training-free and tuning ensemble, offering flexibility based on the availability of computing resources. The proposed ensemble strategies are evaluated on zero-shot, base-to-new, and cross-dataset generalization, achieving new state-of-the-art performance. Notably, this work represents an initial stride toward enhancing the generalization performance of VLMs via ensemble. The code is available at https://github.com/zhiheLu/Ensemble_VLM.git.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Fine-grained Image Classification | Stanford Cars (test) | Accuracy70.76 | 348 | |
| Image Classification | ImageNet | Top-1 Accuracy73.25 | 324 | |
| Action Recognition | UCF101 (test) | Accuracy69.84 | 307 | |
| Fine-grained visual classification | FGVC-Aircraft (test) | Top-1 Acc25.68 | 287 | |
| Base-to-New Generalization | ImageNet | Base Accuracy78.74 | 67 | |
| Image Classification | EuroSAT Base-to-New | Base Score95.52 | 65 | |
| Base-to-New Generalization | FGVCAircraft | Base Performance43.22 | 64 | |
| Image Classification | EuroSAT (test) | Accuracy50.2 | 59 | |
| Image Classification | ImageNet Domain Generalization (Source: ImageNet, Targets: ImageNetV2, ImageNet-Sketch, ImageNet-A, ImageNet-R) (test) | Accuracy (ImageNetV2)65.73 | 53 | |
| Image Classification | ImageNet to 10 Target Datasets (Caltech101, OxfordPets, StanfordCars, Flowers102, Food101, FGVCAircraft, SUN397, DTD, EuroSAT, UCF101) (test) | ImageNet Accuracy70.88 | 48 |