CALIP: Zero-Shot Enhancement of CLIP with Parameter-free Attention
About
Contrastive Language-Image Pre-training (CLIP) has been shown to learn visual representations with great transferability, which achieves promising accuracy for zero-shot classification. To further improve its downstream performance, existing works propose additional learnable modules upon CLIP and fine-tune them by few-shot training sets. However, the resulting extra training cost and data requirement severely hinder the efficiency for model deployment and knowledge transfer. In this paper, we introduce a free-lunch enhancement method, CALIP, to boost CLIP's zero-shot performance via a parameter-free Attention module. Specifically, we guide visual and textual representations to interact with each other and explore cross-modal informative features via attention. As the pre-training has largely reduced the embedding distances between two modalities, we discard all learnable parameters in the attention and bidirectionally update the multi-modal features, enabling the whole process to be parameter-free and training-free. In this way, the images are blended with textual-aware signals and the text representations become visual-guided for better adaptive zero-shot alignment. We evaluate CALIP on various benchmarks of 14 datasets for both 2D image and 3D point cloud few-shot classification, showing consistent zero-shot performance improvement over CLIP. Based on that, we further insert a small number of linear layers in CALIP's attention module and verify our robustness under the few-shot settings, which also achieves leading performance compared to existing methods. Those extensive experiments demonstrate the superiority of our approach for efficient enhancement of CLIP.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Image Classification | ImageNet A | Top-1 Acc23.96 | 553 | |
| Image Classification | ImageNet V2 | Top-1 Acc53.7 | 487 | |
| Image Classification | ImageNet-R | Top-1 Acc60.81 | 474 | |
| Image Classification | ImageNet-Sketch | Top-1 Accuracy35.61 | 360 | |
| Image Classification | ImageNet (test) | Top-1 Accuracy65.81 | 291 | |
| Image Classification | 11 Downstream Classification Datasets (ImageNet, Flowers102, DTD, OxfordPets, StanfordCars, UCF101, Caltech101, Food101, SUN397, FGVC-Aircraft, EuroSAT) standard (test) | DTD Accuracy42.39 | 115 | |
| Image Classification | ImageNet V2 (Target) | Accuracy53.7 | 42 | |
| Image Classification | ImageNet-Sketch (Target) | Accuracy35.61 | 30 | |
| Image Classification | ImageNet (source) | Accuracy60.57 | 23 |