LiT: Zero-Shot Transfer with Locked-image text Tuning
About
This paper presents contrastive-tuning, a simple method employing contrastive training to align image and text models while still taking advantage of their pre-training. In our empirical study we find that locked pre-trained image models with unlocked text models work best. We call this instance of contrastive-tuning "Locked-image Tuning" (LiT), which just teaches a text model to read out good representations from a pre-trained image model for new tasks. A LiT model gains the capability of zero-shot transfer to new vision tasks, such as image classification or retrieval. The proposed LiT is widely applicable; it works reliably with multiple pre-training methods (supervised and unsupervised) and across diverse architectures (ResNet, Vision Transformers and MLP-Mixer) using three different image-text datasets. With the transformer-based pre-trained ViT-g/14 model, the LiT model achieves 85.2% zero-shot transfer accuracy on the ImageNet test set, and 82.5% on the challenging out-of-distribution ObjectNet test set.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet A | Top-1 Acc81.8 | 553 | |
| Image Classification | ImageNet V2 | Top-1 Acc79.8 | 487 | |
| Image Classification | ImageNet-R | Top-1 Acc94.9 | 474 | |
| Image-to-Text Retrieval | Flickr30K 1K (test) | R@183.9 | 439 | |
| Image Classification | ImageNet | -- | 429 | |
| Text-to-Image Retrieval | Flickr30k (test) | Recall@166.5 | 423 | |
| Image Classification | UCF101 | Top-1 Acc60 | 404 | |
| Text-to-Image Retrieval | Flickr30K 1K (test) | R@166.5 | 375 | |
| Image-to-Text Retrieval | Flickr30k (test) | R@183.9 | 370 | |
| Image Classification | ImageNet | Top-1 Accuracy85.2 | 324 |