Efficient Vision-Language Pre-training by Cluster Masking

About

We propose a simple strategy for masking image patches during visual-language contrastive learning that improves the quality of the learned representations and the training speed. During each iteration of training, we randomly mask clusters of visually similar image patches, as measured by their raw pixel intensities. This provides an extra learning signal, beyond the contrastive training itself, since it forces a model to predict words for masked visual structures solely from context. It also speeds up training by reducing the amount of data used in each image. We evaluate the effectiveness of our model by pre-training on a number of benchmarks, finding that it outperforms other masking strategies, such as FLIP, on the quality of the learned representation.

Zihao Wei, Zixuan Pan, Andrew Owens• 2024

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1K	Top-1 Acc62.7	1239
Image Classification	DTD	--	610
Text-to-Image Retrieval	Flickr30K	R@157.6	607
Classification	Cars	Accuracy15.1	571
Image Classification	EuroSAT	--	569
Image Classification	RESISC45	--	539
Image Classification	CIFAR-10	Accuracy89	507
Image-to-Text Retrieval	Flickr30K	R@143.3	451
Image Classification	SUN397	--	425
Image Classification	GTSRB	Accuracy9.6	291

Showing 10 of 46 rows

Other info

Follow for update

@wizwand_team Discord