Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Context-Aware Multimodal Pretraining

About

Large-scale multimodal representation learning successfully optimizes for zero-shot transfer at test time. Yet the standard pretraining paradigm (contrastive learning on large amounts of image-text data) does not explicitly encourage representations to support few-shot adaptation. In this work, we propose a simple, but carefully designed extension to multimodal pretraining which enables representations to accommodate additional context. Using this objective, we show that vision-language models can be trained to exhibit significantly increased few-shot adaptation: across 21 downstream tasks, we find up to four-fold improvements in test-time sample efficiency, and average few-shot adaptation gains of over 5%, while retaining zero-shot generalization performance across model scales and training durations. In particular, equipped with simple, training-free, metric-based adaptation mechanisms, our representations easily surpass more complex and expensive optimization-based schemes, vastly simplifying generalization to new domains.

Karsten Roth, Zeynep Akata, Dima Damen, Ivana Bala\v{z}evi\'c, Olivier J. H\'enaff• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationStanford Cars (val)
Accuracy92.8
56
Image ClassificationImageNet-1k (val)
Accuracy77.9
20
Image ClassificationFood-101 (val)
Accuracy92.6
13
Image ClassificationDTD (val)
Accuracy76.7
13
ClassificationOxford-IIIT Pet (val)
Accuracy94.4
7
Showing 5 of 5 rows

Other info

Follow for update