Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis

About

Diffusion models create data from noise by inverting the forward paths of data towards noise and have emerged as a powerful generative modeling technique for high-dimensional, perceptual data such as images and videos. Rectified flow is a recent generative model formulation that connects data and noise in a straight line. Despite its better theoretical properties and conceptual simplicity, it is not yet decisively established as standard practice. In this work, we improve existing noise sampling techniques for training rectified flow models by biasing them towards perceptually relevant scales. Through a large-scale study, we demonstrate the superior performance of this approach compared to established diffusion formulations for high-resolution text-to-image synthesis. Additionally, we present a novel transformer-based architecture for text-to-image generation that uses separate weights for the two modalities and enables a bidirectional flow of information between image and text tokens, improving text comprehension, typography, and human preference ratings. We demonstrate that this architecture follows predictable scaling trends and correlates lower validation loss to improved text-to-image synthesis as measured by various metrics and human evaluations. Our largest models outperform state-of-the-art models, and we will make our experimental data, code, and model weights publicly available.

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas M\"uller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, Robin Rombach• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationGenEval
Overall Score74
506
Text-to-Image GenerationGenEval
Overall Score74
391
Text-to-Image GenerationGenEval
GenEval Score74
360
Text-to-Image GenerationDPG-Bench
Overall Score84.1
265
Image GenerationImageNet (val)
Inception Score250
247
Text-to-Image GenerationGenEval (test)
Two Obj. Acc94
221
Text-to-Image GenerationGenEval
Overall Score74
218
Text-to-Image GenerationMS-COCO (val)
FID5.08
202
Text-to-Image GenerationT2I-CompBench
Shape Fidelity65.56
185
Text-to-Image GenerationDPG
Overall Score85.43
172
Showing 10 of 251 rows
...

Other info

Follow for update