Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PixArt-\Sigma: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation

About

In this paper, we introduce PixArt-\Sigma, a Diffusion Transformer model~(DiT) capable of directly generating images at 4K resolution. PixArt-\Sigma represents a significant advancement over its predecessor, PixArt-\alpha, offering images of markedly higher fidelity and improved alignment with text prompts. A key feature of PixArt-\Sigma is its training efficiency. Leveraging the foundational pre-training of PixArt-\alpha, it evolves from the `weaker' baseline to a `stronger' model via incorporating higher quality data, a process we term "weak-to-strong training". The advancements in PixArt-\Sigma are twofold: (1) High-Quality Training Data: PixArt-\Sigma incorporates superior-quality image data, paired with more precise and detailed image captions. (2) Efficient Token Compression: we propose a novel attention module within the DiT framework that compresses both keys and values, significantly improving efficiency and facilitating ultra-high-resolution image generation. Thanks to these improvements, PixArt-\Sigma achieves superior image quality and user prompt adherence capabilities with significantly smaller model size (0.6B parameters) than existing text-to-image diffusion models, such as SDXL (2.6B parameters) and SD Cascade (5.1B parameters). Moreover, PixArt-\Sigma's capability to generate 4K images supports the creation of high-resolution posters and wallpapers, efficiently bolstering the production of high-quality visual content in industries such as film and gaming.

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, Zhenguo Li• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score48
467
Text-to-Image GenerationGenEval
GenEval Score54
277
Text-to-Image GenerationDPG-Bench
Overall Score80.54
173
Text-to-Image GenerationGenEval (test)--
169
Text-to-Image GenerationDPG
Overall Score80.54
131
Text-to-Image GenerationMS-COCO 2014 (val)--
128
Text-to-Image GenerationMS-COCO (val)
FID13.35
112
Text-to-Image GenerationDPG-Bench
DPG Score80.5
89
Text-to-Image GenerationGenEval
Two Objects62
87
Text-to-Image GenerationT2I-CompBench (test)--
67
Showing 10 of 29 rows

Other info

Code

Follow for update