SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models
About
Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | EuroSAT | Accuracy71.7 | 569 | |
| Image Classification | Flowers102 | Accuracy85.1 | 558 | |
| Image Classification | DTD | Accuracy57.5 | 542 | |
| Image Classification | Food101 | Accuracy90.2 | 457 | |
| Image Classification | SUN397 | Accuracy73 | 441 | |
| Image Classification | RESISC45 | Accuracy88.8 | 349 | |
| Image Classification | Aircraft | Accuracy31.8 | 333 | |
| Image Classification | StanfordCars | Accuracy78.8 | 312 | |
| Image Classification | Pets | Accuracy95.4 | 245 | |
| Image Classification | Caltech101 | Accuracy96.9 | 228 |