MaskGIT: Masked Generative Image Transformer

About

Generative transformers have experienced rapid popularity growth in the computer vision community in synthesizing high-fidelity and high-resolution images. The best generative transformer models so far, however, still treat an image naively as a sequence of tokens, and decode an image sequentially following the raster scan ordering (i.e. line-by-line). We find this strategy neither optimal nor efficient. This paper proposes a novel image synthesis paradigm using a bidirectional transformer decoder, which we term MaskGIT. During training, MaskGIT learns to predict randomly masked tokens by attending to tokens in all directions. At inference time, the model begins with generating all tokens of an image simultaneously, and then refines the image iteratively conditioned on the previous generation. Our experiments demonstrate that MaskGIT significantly outperforms the state-of-the-art transformer model on the ImageNet dataset, and accelerates autoregressive decoding by up to 64x. Besides, we illustrate that MaskGIT can be easily extended to various image editing tasks, such as inpainting, extrapolation, and image manipulation.

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, William T. Freeman• 2022

Related benchmarks

Task	Dataset	Result
Mathematical Reasoning	GSM8K	Accuracy78.8	1398
Code Generation	HumanEval	--	1043
Class-conditional Image Generation	ImageNet 256x256	Inception Score (IS)355.6	967
Mathematical Reasoning	GSM8K (test)	Accuracy52.2	954
Mathematical Reasoning	MATH	Accuracy18	535
Image Generation	ImageNet 256x256	IS355.6	517
Text-to-Image Generation	GenEval	Overall Score52	517
Mathematical Reasoning	GSM8K	Accuracy35.03	499
Class-conditional Image Generation	ImageNet 256x256 (val)	Inception Score (IS)355.6	493
Image Generation	ImageNet 256x256 (val)	FID6.18	399

Showing 10 of 129 rows

...

Other info

Follow for update

@wizwand_team Discord