Contrastive Vision-Language Pre-training with Limited Resources

About

Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we propose a stack of novel methods, which significantly cut down the heavy resource dependency and allow us to conduct dual-encoder multi-modal representation alignment with limited resources. Besides, we provide a reproducible baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our methods on large-scale data. We hope that this work will provide useful data points and experience for future research in contrastive vision-language pre-training. Code is available at https://github.com/zerovl/ZeroVL.

Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie, Yubo Chen• 2021

Related benchmarks

Task	Dataset	Result
Image Copy Detection	GLIDE (test)	Average Similarity0.707	28
Image Copy Detection	SDXL (test)	Avg Similarity67.7	28
Image Copy Detection	Midjourney (test)	Average Similarity58.1	28
Image Copy Detection	DeepFloyd IF (test)	Average Similarity68.1	28
Image Copy Detection	DALL-E 2 (test)	Average Similarity0.585	28
Image Copy Detection	New Bing (test)	Average Similarity0.589	28
Image Copy Detection	SD 1.5 (test)	Average Similarity0.578	28
ICDiff	D-Rep (test)	PCC36.3	20

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord