Switti: Designing Scale-Wise Transformers for Text-to-Image Synthesis

About

This work presents Switti, a scale-wise transformer for text-to-image generation. We start by adapting an existing next-scale prediction autoregressive (AR) architecture to T2I generation, investigating and mitigating training stability issues in the process. Next, we argue that scale-wise transformers do not require causality and propose a non-causal counterpart facilitating ~21% faster sampling and lower memory usage while also achieving slightly better generation quality. Furthermore, we reveal that classifier-free guidance at high-resolution scales is often unnecessary and can even degrade performance. By disabling guidance at these scales, we achieve an additional sampling acceleration of ~32% and improve the generation of fine-grained details. Extensive human preference studies and automated evaluations show that Switti outperforms existing T2I AR models and competes with state-of-the-art T2I diffusion models while being up to 7x faster.

Anton Voronov, Denis Kuznedelev, Mikhail Khoroshikh, Valentin Khrulkov, Dmitry Baranchuk• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	GenEval	Overall Score0.62	277
Text-to-Image Generation	GenEval (test)	--	250
Text-to-Image Generation	MJHQ-30K	Overall FID8.1	239
Text to Image	MJHQ 30K (test)	PS (Perceptual Score)21.6	18
Text to Image	COCO 30K (test)	PS22.6	18

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord