PixelFlow: Pixel-Space Generative Models with Flow
About
We present PixelFlow, a family of image generation models that operate directly in the raw pixel space, in contrast to the predominant latent-space models. This approach simplifies the image generation process by eliminating the need for a pre-trained Variational Autoencoder (VAE) and enabling the whole model end-to-end trainable. Through efficient cascade flow modeling, PixelFlow achieves affordable computation cost in pixel space. It achieves an FID of 1.98 on 256$\times$256 ImageNet class-conditional image generation benchmark. The qualitative text-to-image results demonstrate that PixelFlow excels in image quality, artistry, and semantic control. We hope this new paradigm will inspire and open up new opportunities for next-generation visual generation models. Code and models are available at https://github.com/ShoufaChen/PixelFlow.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Class-conditional Image Generation | ImageNet 256x256 | Inception Score (IS)282.1 | 441 | |
| Image Generation | ImageNet 256x256 (val) | FID1.98 | 307 | |
| Class-conditional Image Generation | ImageNet 256x256 (train) | IS282.1 | 305 | |
| Class-conditional Image Generation | ImageNet 256x256 (val) | FID1.98 | 293 | |
| Class-conditional Image Generation | ImageNet 256x256 (train val) | FID1.98 | 178 | |
| Text-to-Image Generation | GenEval (test) | -- | 169 | |
| Class-conditional Image Generation | ImageNet-1K 256x256 1.0 (train) | gFID1.98 | 35 | |
| Image Generation | ImageNet 256x256 (train val) | FID1.98 | 34 | |
| Class-to-image generation | ImageNet 256x256 | FID12.23 | 15 |