Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SOTA: Self-adaptive Optimal Transport for Zero-Shot Classification with Multiple Foundation Models

About

Foundation models have attracted widespread attention across domains due to their powerful zero-shot classification capabilities. This work is motivated by two key observations: (1) \textit{Vision-Language Models} (VLMs), such as CLIP, often over-rely on class-level textual priors and struggle to capture fine-grained visual cues, whereas \textit{Vision-only Foundation Models} (VFMs), such as DINO, provide rich and discriminative visual features but lack semantic alignment; (2) the performance of different VLMs varies considerably across datasets owing to differences in pre-training. To address these challenges, we propose \textbf{SOTA} (\textit{Self-adaptive Optimal TrAnsport}), a \textit{training-free} ensemble framework that integrates the outputs of multiple foundation models~(VFMs or VLMs) by learning a self-adaptive transport plan. Notably, \textbf{SOTA} is prior-free and automatically balances model contributions. Extensive experiments across diverse domains, including natural images, medical pathology, and remote sensing, validate the generalizability of \textbf{SOTA}. The results consistently show that it effectively leverages the complementary strengths of different foundation models and achieves substantial improvements over individual models. The implementation code is available at: https://github.com/Afleve/self-adaptive-Optimal-Transport.

Zhanxuan Hu, Qiyu Xu, Yu Duan, Yonghang Tai, Huafeng Li• 2025

Related benchmarks

TaskDatasetResultRank
Image ClassificationEuroSAT
Accuracy71.7
569
Image ClassificationFlowers102
Accuracy85.1
558
Image ClassificationDTD
Accuracy57.5
542
Image ClassificationFood101
Accuracy90.2
457
Image ClassificationSUN397
Accuracy73
441
Image ClassificationRESISC45
Accuracy88.8
349
Image ClassificationAircraft
Accuracy31.8
333
Image ClassificationStanfordCars
Accuracy78.8
312
Image ClassificationPets
Accuracy95.4
245
Image ClassificationCaltech101
Accuracy96.9
228
Showing 10 of 27 rows

Other info

Follow for update