StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

About

We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan• 2023

Related benchmarks

Task	Dataset	Result
Image Classification	ImageNet-1k (val)	Top-1 Accuracy74.5	1498
Semantic segmentation	ADE20K	mIoU49.4	1028
Image Classification	CIFAR-100	Accuracy92.9	691
Image Classification	Food-101	Accuracy91.8	570
Classification	Cars	Accuracy91.8	492
Image Classification	DTD	Accuracy86.4	487
Image Classification	SUN397	Accuracy97.3	450
Image Classification	Aircraft	Accuracy62.6	340
Image Classification	CIFAR-10	Accuracy92.7	246
Image Classification	Caltech-101	Accuracy98.9	211

Showing 10 of 28 rows

Other info

Code

Follow for update

@wizwand_team Discord