Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CLIPS: An Enhanced CLIP Framework for Learning with Synthetic Captions

About

Previous works show that noisy, web-crawled image-text pairs may limit vision-language pretraining like CLIP and propose learning with synthetic captions as a promising alternative. Our work continues this effort, introducing two simple yet effective designs to better leverage richly described synthetic captions. Firstly, by observing a strong inverse effect in learning with synthetic captions -- the short synthetic captions can generally lead to MUCH higher performance than full-length ones -- we therefore fed only partial synthetic captions to the text encoder. Secondly, we incorporate an autoregressive captioner to mimic the recaptioning process -- by conditioning on the paired image input and web-crawled text description, the captioner learns to predict the full-length synthetic caption generated by advanced MLLMs. Experiments show that our framework significantly improves zero-shot performance in cross-modal retrieval tasks, setting new SOTA results on MSCOCO and Flickr30K. Moreover, such trained vision encoders can enhance the visual capability of LLaVA, showing strong improvements on a range of MLLM benchmarks. Our project page is https://ucsc-vlaa.github.io/CLIPS/.

Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, Cihang Xie• 2024

Related benchmarks

TaskDatasetResultRank
Zero-shot Image ClassificationImageNet-1K
Top-1 Accuracy76.8
101
Compositional ReasoningSugarCrepe
Overall Accuracy75.2
50
Zero-shot Image-Text RetrievalFlickr30K
Accuracy (Zero-shot)97.7
7
Zero-shot Image-Text RetrievalDOCCI
Accuracy66.2
7
Zero-shot Image-Text RetrievalIIW
Zero-shot Accuracy (IIW)81.8
7
Compositional ReasoningSugarCrepe++
Replace I2T76.6
7
Zero-shot Image-Text RetrievalMSCOCO
Accuracy82.1
7
Showing 7 of 7 rows

Other info

Follow for update