Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Make-A-Scene: Scene-Based Text-to-Image Generation with Human Priors

About

Recent text-to-image generation methods provide a simple yet exciting conversion capability between text and image domains. While these methods have incrementally improved the generated image fidelity and text relevancy, several pivotal gaps remain unanswered, limiting applicability and quality. We propose a novel text-to-image method that addresses these gaps by (i) enabling a simple control mechanism complementary to text in the form of a scene, (ii) introducing elements that substantially improve the tokenization process by employing domain-specific knowledge over key image regions (faces and salient objects), and (iii) adapting classifier-free guidance for the transformer use case. Our model achieves state-of-the-art FID and human evaluation results, unlocking the ability to generate high fidelity images in a resolution of 512x512 pixels, significantly improving visual quality. Through scene controllability, we introduce several new capabilities: (i) Scene editing, (ii) text editing with anchor scenes, (iii) overcoming out-of-distribution text prompts, and (iv) story illustration generation, as demonstrated in the story we wrote.

Oran Gafni, Adam Polyak, Oron Ashual, Shelly Sheynin, Devi Parikh, Yaniv Taigman• 2022

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationMS-COCO 2014 (val)
FID2.47
128
Text-to-Image GenerationMS-COCO (val)
FID7.55
112
Text-to-Image GenerationMS-COCO
FID11.8
75
Text-to-Image SynthesisMS-COCO 2014 (val)
FID11.84
58
Text-to-Image GenerationMS-COCO 256x256 (val)
FID7.55
53
Text-to-Image GenerationCOCO 30k subset 2014 (val)
FID2.47
46
Text-to-Image GenerationMS COCO zero-shot
FID11.84
42
Text-to-Image GenerationCOCO 256 x 256 2014 (val)
FID7.55
37
Text-to-Image SynthesisMSCOCO
FID7.55
31
Grounded Text-to-Image GenerationCOCO 2014 (val)
FID7.55
26
Showing 10 of 20 rows

Other info

Follow for update