Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Harnessing Frozen Unimodal Encoders for Flexible Multimodal Alignment

About

Recent contrastive multimodal vision-language models like CLIP have demonstrated robust open-world semantic understanding, becoming the standard image backbones for vision-language applications. However, recent findings suggest high semantic similarity between well-trained unimodal encoders, which raises a key question: Is there a plausible way to connect unimodal backbones for vision-language tasks? To this end, we propose a novel framework that aligns vision and language using frozen unimodal encoders. It involves selecting semantically similar encoders in the latent space, curating a concept-rich dataset of image-caption pairs, and training simple MLP projectors. We evaluated our approach on 12 zero-shot classification datasets and 2 image-text retrieval datasets. Our best model, utilizing DINOv2 and All-Roberta-Large text encoder, achieves 76\(\%\) accuracy on ImageNet with a 20-fold reduction in data and 65-fold reduction in compute requirements compared multi-modal alignment where models are trained from scratch. The proposed framework enhances the accessibility of multimodal model development while enabling flexible adaptation across diverse scenarios. Code and curated datasets are available at \texttt{github.com/mayug/freeze-align}.

Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Ankit Singh, Noel E. O'Connor• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet V2--
749
Image ClassificationUCF101
Top-1 Acc73.2
527
Text-to-Image RetrievalFlickr30k (test)
Recall@174.1
525
ClassificationCars
Accuracy73.9
492
Image-to-Text RetrievalFlickr30k (test)
R@187.5
472
Image ClassificationImageNet--
431
Image ClassificationCUB
Accuracy66.1
331
Semantic segmentationPascal Context
mIoU24.61
217
Image ClassificationFood
Accuracy89.1
152
Image ClassificationCaltech
Accuracy92.8
129
Showing 10 of 30 rows

Other info

Follow for update