Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Better Language Models Exhibit Higher Visual Alignment

About

How well do text-only large language models (LLMs) align with the visual world? We present a systematic evaluation of this question by incorporating frozen representations of various language models into a discriminative vision-language framework and measuring zero-shot generalization to novel concepts. We find that decoder-based models exhibit stronger visual alignment than encoders, even when controlling for model and dataset size. Moreover, language modeling performance correlates with visual generalization, suggesting that advances in unimodal LLMs can simultaneously improve vision models. Leveraging these insights, we propose ShareLock, a lightweight method for fusing frozen vision and language backbones. ShareLock achieves robust performance across tasks while drastically reducing the need for paired data and compute. With just 563k image-caption pairs and under one GPU-hour of training, it reaches 51% accuracy on ImageNet. In cross-lingual settings, ShareLock dramatically outperforms CLIP, achieving 38.7% top-1 accuracy on Chinese image classification versus CLIP's 1.4%. Code is available.

Jona Ruthardt, Gertjan J. Burghouts, Serge Belongie, Yuki M. Asano• 2024

Related benchmarks

TaskDatasetResultRank
Image ClassificationImageNet-1K
Top-1 Acc59.1
836
Text-to-Image RetrievalFlickr30K
R@138.5
460
Image-to-Text RetrievalFlickr30K
R@154.8
379
Image-to-Text RetrievalMSCOCO
R@130
124
Text-to-Image RetrievalMSCOCO
R@116.5
118
Compositional ReasoningWinoground
Txt2Img Score26.3
21
Image ClassificationCLIP Zero-shot Evaluation Suite (10 datasets)
Cars Accuracy13.2
16
Showing 7 of 7 rows

Other info

Follow for update