Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

About

Pixel diffusion generates images directly in pixel space, avoiding the VAE artifacts and representational bottlenecks of two-stage latent diffusion. Recent JiT further simplifies pixel diffusion with x-prediction, where the model predicts clean images rather than velocity. However, the standard pixel-wise diffusion loss treats all pixels equally, spending model capacity to perceptually insignificant signals and often leading to blurry samples. We propose PixelGen, an end-to-end pixel diffusion framework that augments x-prediction with perceptual supervision. Specifically, PixelGen introduces two complementary perceptual losses on top of x-prediction: an LPIPS loss for local textures and a P-DINO loss for global semantics. To preserve sample coverage, PixelGen further proposes a noise-gating strategy that applies these losses only at lower-noise timesteps. On ImageNet-256 without classifier-free guidance, PixelGen achieves an FID of 5.11 in 80 training epochs, surpassing the latent diffusion baselines. Moreover, PixelGen scales efficiently to text-to-image generation, reaching a GenEval score of 0.79 with only 6 days of training on 8xH800 GPUs. These results show that perceptual supervision substantially narrows the gap between pixel and latent diffusion while preserving a simple one-stage pipeline. Codes are available at https://github.com/Zehong-Ma/PixelGen.

Zehong Ma, Ruihan Xu, Shiliang Zhang• 2026

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256--
967
Image GenerationImageNet 256x256
IS293.6
517
Class-conditional Image GenerationImageNet 256x256 (val)
Inception Score (IS)293.6
493
Text-to-Image GenerationDPG-Bench--
451
Text-to-Image GenerationGenEval (test)
Two Obj. Acc88
250
Class-conditional Image GenerationImageNet 256x256 (train val)--
203
Text-to-Image GenerationGenEval
Overall Score (GenEval)0.79
153
Class-to-image generationImageNet 256x256
FID5.11
38
Image GenerationImageNet 256x256 (no CFG)
gFID5.11
11
Text-to-Image GenerationPixelGen 512x512 (test)
ImageReward92.1
3
Showing 10 of 10 rows

Other info

GitHub

Follow for update