Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

PixNerd: Pixel Neural Field Diffusion

About

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang• 2025

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256
Inception Score (IS)297
441
Image GenerationImageNet 256x256 (val)
FID2.15
307
Class-conditional Image GenerationImageNet 256x256 (train)
IS297
305
Class-conditional Image GenerationImageNet 256x256 (val)
FID1.95
293
Class-conditional Image GenerationImageNet 256x256 (train val)
FID2.15
178
Text-to-Image GenerationGenEval (test)
Two Obj. Acc86
169
Class-conditional Image GenerationImageNet-1K 256x256 1.0 (train)
gFID1.95
35
Image GenerationImageNet 256x256 (train val)
FID2.15
34
Image GenerationImageNet 512x512
FID2.84
34
Class-to-image generationImageNet 256x256
FID15.61
15
Showing 10 of 10 rows

Other info

Follow for update