Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Muse: Text-To-Image Generation via Masked Generative Transformers

About

We present Muse, a text-to-image Transformer model that achieves state-of-the-art image generation performance while being significantly more efficient than diffusion or autoregressive models. Muse is trained on a masked modeling task in discrete token space: given the text embedding extracted from a pre-trained large language model (LLM), Muse is trained to predict randomly masked image tokens. Compared to pixel-space diffusion models, such as Imagen and DALL-E 2, Muse is significantly more efficient due to the use of discrete tokens and requiring fewer sampling iterations; compared to autoregressive models, such as Parti, Muse is more efficient due to the use of parallel decoding. The use of a pre-trained LLM enables fine-grained language understanding, translating to high-fidelity image generation and the understanding of visual concepts such as objects, their spatial relationships, pose, cardinality etc. Our 900M parameter model achieves a new SOTA on CC3M, with an FID score of 6.06. The Muse 3B parameter model achieves an FID of 7.88 on zero-shot COCO evaluation, along with a CLIP score of 0.32. Muse also directly enables a number of image editing applications without the need to fine-tune or invert the model: inpainting, outpainting, and mask-free editing. More results are available at https://muse-model.github.io

Huiwen Chang, Han Zhang, Jarred Barber, AJ Maschinot, Jose Lezama, Lu Jiang, Ming-Hsuan Yang, Kevin Murphy, William T. Freeman, Michael Rubinstein, Yuanzhen Li, Dilip Krishnan• 2023

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationMS-COCO 2014 (val)--
128
Text-to-Image GenerationMS-COCO (val)
FID7.88
112
Text-to-Image GenerationCOCO 30k subset 2014 (val)--
46
Text-to-Image GenerationMS COCO zero-shot
FID7.88
42
Text-to-Image GenerationMS-COCO 30K (test)
FID7.88
41
Text-to-Image GenerationMS-COCO zero-shot FID-30k 256x256--
21
Text-to-Image GenerationMS-COCO 256x256 (test)
FID (30K)7.88
14
Text-to-Image SynthesisCOCO 2014 (test)
FID-30k7.88
13
Text-to-Image SynthesisMS COCO 64x64 zero-shot
Zero-shot FID30k7.88
13
Text-to-Image GenerationCC3M
FID6.06
7
Showing 10 of 11 rows

Other info

Code

Follow for update