PixNerd: Pixel Neural Field Diffusion

About

The current success of diffusion transformers heavily depends on the compressed latent space shaped by the pre-trained variational autoencoder(VAE). However, this two-stage training paradigm inevitably introduces accumulated errors and decoding artifacts. To address the aforementioned problems, researchers return to pixel space at the cost of complicated cascade pipelines and increased token complexity. In contrast to their efforts, we propose to model the patch-wise decoding with neural field and present a single-scale, single-stage, efficient, end-to-end solution, coined as pixel neural field diffusion~(PixelNerd). Thanks to the efficient neural field representation in PixNerd, we directly achieved 2.15 FID on ImageNet $256\times256$ and 2.84 FID on ImageNet $512\times512$ without any complex cascade pipeline or VAE. We also extend our PixNerd framework to text-to-image applications. Our PixNerd-XXL/16 achieved a competitive 0.73 overall score on the GenEval benchmark and 80.9 overall score on the DPG benchmark.

Shuai Wang, Ziteng Gao, Chenhui Zhu, Weilin Huang, Limin Wang• 2025

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet 256x256	Inception Score (IS)297	967
Text-to-Image Generation	GenEval	Overall Score73	704
Image Generation	ImageNet 256x256	IS300	517
Class-conditional Image Generation	ImageNet 256x256 (val)	Inception Score (IS)298	493
Text-to-Image Generation	DPG-Bench	--	451
Image Generation	ImageNet 256x256 (val)	FID1.95	399
Class-conditional Image Generation	ImageNet 256x256 (train)	IS297	367
Text-to-Image Generation	GenEval (test)	Two Obj. Acc86	250
Image Generation	ImageNet 256x256 (train)	FID1.93	211
Class-conditional Image Generation	ImageNet 256x256 (train val)	FID2.15	203

Showing 10 of 22 rows

Other info

Follow for update

@wizwand_team Discord