VQGAN-CLIP: Open Domain Image Generation and Editing with Natural Language Guidance

About

Generating and editing images from open domain text prompts is a challenging task that heretofore has required expensive and specially trained models. We demonstrate a novel methodology for both tasks which is capable of producing images of high visual quality from text prompts of significant semantic complexity without any training by using a multimodal encoder to guide image generations. We demonstrate on a variety of tasks how using CLIP [37] to guide VQGAN [11] produces higher visual quality outputs than prior, less flexible approaches like DALL-E [38], GLIDE [33] and Open-Edit [24], despite not being trained for the tasks presented. Our code is available in a public repository.

Katherine Crowson, Stella Biderman, Daniel Kornis, Dashiell Stander, Eric Hallahan, Louis Castricato, Edward Raff• 2022

Related benchmarks

Task	Dataset	Result
Longitudinal Brain MRI Synthesis	ADNI (test)	SSIM0.7463	13
Target (Aircraft) Classification	Boeing simulated	Precision84.69	10
Azimuth Angle Classification	Boeing simulated	Precision4.84	10
Depression Angle Classification	Boeing simulated	Precision0.1424	10
Polarization Mode Classification	Shanxi real-world (test)	Precision75.05	10
Azimuth Angle Classification	Shanxi real-world (test)	Precision1.39	10
Target (Aircraft) Classification	Shanxi real-world (test)	Precision92.16	10
SAR Image Generation	Shanxi dataset (test)	PSNR23.87	9
SAR Image Generation	Boeing (test)	PSNR29.7	9
Longitudinal Brain MRI Synthesis	Brain MRI 0 ≤ Δt < 12 (test)	SSIM0.7553	7

Showing 10 of 14 rows

Other info

Follow for update

@wizwand_team Discord