Learning Stackable and Skippable LEGO Bricks for Efficient, Reconfigurable, and Variable-Resolution Diffusion Modeling

About

Diffusion models excel at generating photo-realistic images but come with significant computational costs in both training and sampling. While various techniques address these computational challenges, a less-explored issue is designing an efficient and adaptable network backbone for iterative refinement. Current options like U-Net and Vision Transformer often rely on resource-intensive deep networks and lack the flexibility needed for generating images at variable resolutions or with a smaller network than used in training. This study introduces LEGO bricks, which seamlessly integrate Local-feature Enrichment and Global-content Orchestration. These bricks can be stacked to create a test-time reconfigurable diffusion backbone, allowing selective skipping of bricks to reduce sampling costs and generate higher-resolution images than the training data. LEGO bricks enrich local regions with an MLP and transform them using a Transformer block while maintaining a consistent full-resolution image across all bricks. Experimental results demonstrate that LEGO bricks enhance training efficiency, expedite convergence, and facilitate variable-resolution image generation while maintaining strong generative performance. Moreover, LEGO significantly reduces sampling time compared to other methods, establishing it as a valuable enhancement for diffusion models. Our code and project page are available at https://jegzheng.github.io/LEGODiffusion.

Huangjie Zheng, Zhendong Wang, Jianbo Yuan, Guanghan Ning, Pengcheng He, Quanzeng You, Hongxia Yang, Mingyuan Zhou• 2023

Related benchmarks

Task	Dataset	Result
Class-conditional Image Generation	ImageNet 256x256 (train)	IS338.1	367
Unconditional Image Generation	CIFAR-10 unconditional	FID1.88	209
Unconditional Image Generation	CelebA unconditional 64 x 64	FID2.09	95
Image Generation	ImageNet 512x512 (test)	FID3.74	74
Class-conditional Image Generation	ImageNet 512x512 (train)	FID3.74	52
Panorama Generation	ImageNet 256x256	LPIPS0.14	6
Panorama Generation	ImageNet 512x512	LPIPS0.36	6
Conditional Image Generation	ImageNet 64 x 64 (train)	FID2.16	4

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord