Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PixelDiT: Pixel Diffusion Transformers for Image Generation

About

Latent-space modeling has been the standard for Diffusion Transformers (DiTs). However, it relies on a two-stage pipeline where the pretrained autoencoder introduces lossy reconstruction, leading to error accumulation while hindering joint optimization. To address these issues, we propose PixelDiT, a single-stage, end-to-end model that eliminates the need for the autoencoder and learns the diffusion process directly in the pixel space. PixelDiT adopts a fully transformer-based architecture shaped by a dual-level design: a patch-level DiT that captures global semantics and a pixel-level DiT that refines texture details, enabling efficient training of a pixel-space diffusion model while preserving fine details. PixelDiT achieves 1.61 FID on ImageNet 256 and 1.81 FID on ImageNet 512, surpassing existing pixel generative models. We further extend PixelDiT to text-to-image generation and pretrain it at the 10242resolution in pixel space. It achieves 0.74 on GenEval and 83.5 on DPG-bench, approaching the best latent diffusion models. Code: https://github.com/NVlabs/PixelDiT

Yongsheng Yu, Wei Xiong, Weili Nie, Yichen Sheng, Shiqiu Liu, Jiebo Luo• 2025

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256--
967
Image GenerationImageNet 256x256
IS292.7
517
Text-to-Image GenerationGenEval
GenEval Score74
442
Image GenerationImageNet 256x256 (val)
FID1.61
399
Text-to-Image GenerationDPG-Bench
DPG Score83.5
156
Text-to-Image GenerationGenEval
Overall Score (GenEval)0.74
153
Image GenerationImageNet 512x512
IS271.1
83
Text-to-Image GenerationHPS v3
Overall Score8.95
48
Class-to-image generationImageNet 256x256
FID1.61
38
Class-conditional Image GenerationImageNet-1K 256x256 1.0 (train)
gFID1.61
35
Showing 10 of 10 rows

Other info

Follow for update