Scaling Language-Image Pre-training via Masking

About

We present Fast Language-Image Pre-training (FLIP), a simple and more efficient method for training CLIP. Our method randomly masks out and removes a large portion of image patches during training. Masking allows us to learn from more image-text pairs given the same wall-clock time and contrast more samples per iteration with similar memory footprint. It leads to a favorable trade-off between accuracy and training time. In our experiments on 400 million image-text pairs, FLIP improves both accuracy and speed over the no-masking baseline. On a large diversity of downstream tasks, FLIP dominantly outperforms the CLIP counterparts trained on the same data. Facilitated by the speedup, we explore the scaling behavior of increasing the model size, data size, or training length, and report encouraging results and comparisons. We hope that our work will foster future research on scaling vision-language learning.

Yanghao Li, Haoqi Fan, Ronghang Hu, Christoph Feichtenhofer, Kaiming He• 2022

Related benchmarks

Task	Dataset	Result
Object Hallucination Evaluation	POPE	Accuracy86.5	2019
Visual Question Answering	GQA	Accuracy60.9	1425
Image Classification	ImageNet-1K	Top-1 Acc61.3	1239
Image Classification	ImageNet 1k (test)	Top-1 Accuracy86.9	880
Multimodal Understanding	MMBench	--	847
Image Classification	ImageNet V2	Top-1 Acc66.8	749
Multimodal Evaluation	MME	--	727
Visual Question Answering	VQA v2 (test-dev)	Overall Accuracy74.7	712
Image Classification	ImageNet A	Top-1 Acc71.9	698
Image Classification	Stanford Cars	Accuracy90.9	660

Showing 10 of 119 rows

...

Other info

Code

Follow for update

@wizwand_team Discord