SuperCLIP: CLIP with Simple Classification Supervision
About
Contrastive Language-Image Pretraining (CLIP) achieves strong generalization in vision-language tasks by aligning images and texts in a shared embedding space. However, recent findings show that CLIP-like models still underutilize fine-grained semantic signals in text, and this issue becomes even more pronounced when dealing with long and detailed captions. This stems from CLIP's training objective, which optimizes only global image-text similarity and overlooks token-level supervision - limiting its ability to achieve fine-grained visual-text alignment. To address this, we propose SuperCLIP, a simple yet effective framework that augments contrastive learning with classification-based supervision. By adding only a lightweight linear layer to the vision encoder, SuperCLIP leverages token-level cues to enhance visual-textual alignment - with just a 0.077% increase in total FLOPs, and no need for additional annotated data. Experiments show that SuperCLIP consistently improves zero-shot classification, image-text retrieval, and purely visual tasks. These gains hold regardless of whether the model is trained on original web data or rich re-captioned data, demonstrating SuperCLIP's ability to recover textual supervision in both cases. Furthermore, SuperCLIP alleviates CLIP's small-batch performance drop through classification-based supervision that avoids reliance on large batch sizes. Code and models will be made open source.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Visual Question Answering | VQA v2 | Accuracy69.6 | 1165 | |
| Visual Question Answering | VizWiz | Accuracy44.4 | 1043 | |
| Semantic segmentation | ADE20K | mIoU36.3 | 936 | |
| Object Hallucination Evaluation | POPE | Accuracy82 | 935 | |
| Multimodal Evaluation | MME | Score1.56e+3 | 557 | |
| Image Classification | ImageNet-1k (val) | -- | 512 | |
| Text-based Visual Question Answering | TextVQA | Accuracy48.4 | 496 | |
| Visual Question Answering | GQA | Accuracy57.5 | 374 | |
| Science Question Answering | ScienceQA | Accuracy69.1 | 229 | |
| Image Classification | ImageNet-1K | Accuracy81 | 190 |