Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PixNerd: Pixel Neural Field Diffusion

About

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang• 2025

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)297
815
Class-conditional Image GenerationImageNet 256x256 (val)
FID1.95
427
Text-to-Image GenerationGenEval
Overall Score73
391
Image GenerationImageNet 256x256
IS297
359
Class-conditional Image GenerationImageNet 256x256 (train)
IS297
345
Image GenerationImageNet 256x256 (val)
FID1.95
340
Text-to-Image GenerationGenEval (test)
Two Obj. Acc86
221
Class-conditional Image GenerationImageNet 256x256 (train val)
FID2.15
178
Image GenerationImageNet 256x256 (train)
FID1.93
164
Class-conditional Image GenerationImageNet 512x512 (val)
FID (Val)2.84
97
Showing 10 of 16 rows

Other info

Follow for update