Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Contrastive Vision-Language Pre-training with Limited Resources

About

Pioneering dual-encoder pre-training works (e.g., CLIP and ALIGN) have revealed the potential of aligning multi-modal representations with contrastive learning. However, these works require a tremendous amount of data and computational resources (e.g., billion-level web data and hundreds of GPUs), which prevent researchers with limited resources from reproduction and further exploration. To this end, we propose a stack of novel methods, which significantly cut down the heavy resource dependency and allow us to conduct dual-encoder multi-modal representation alignment with limited resources. Besides, we provide a reproducible baseline of competitive results, namely ZeroVL, with only 14M publicly accessible academic datasets and 8 V100 GPUs. Additionally, we collect 100M web data for pre-training, and achieve comparable or superior results than state-of-the-art methods, further proving the effectiveness of our methods on large-scale data. We hope that this work will provide useful data points and experience for future research in contrastive vision-language pre-training. Code is available at https://github.com/zerovl/ZeroVL.

Quan Cui, Boyan Zhou, Yu Guo, Weidong Yin, Hao Wu, Osamu Yoshie, Yubo Chen• 2021

Related benchmarks

TaskDatasetResultRank
Image Copy DetectionGLIDE (test)
Average Similarity0.707
28
Image Copy DetectionSDXL (test)
Avg Similarity67.7
28
Image Copy DetectionMidjourney (test)
Average Similarity58.1
28
Image Copy DetectionDeepFloyd IF (test)
Average Similarity68.1
28
Image Copy DetectionDALL-E 2 (test)
Average Similarity0.585
28
Image Copy DetectionNew Bing (test)
Average Similarity0.589
28
Image Copy DetectionSD 1.5 (test)
Average Similarity0.578
28
ICDiffD-Rep (test)
PCC36.3
20
Showing 8 of 8 rows

Other info

Follow for update