Training-free Mixed-Resolution Latent Upsampling for Spatially Accelerated Diffusion Transformers

About

Diffusion transformers (DiTs) offer excellent scalability for high-fidelity generation, but their computational overhead poses a great challenge for practical deployment. Existing acceleration methods primarily exploit the temporal dimension, whereas spatial acceleration remains underexplored. In this work, we investigate spatial acceleration for DiTs via latent upsampling. We found that na\"ive latent upsampling for spatial acceleration introduces artifacts, primarily due to aliasing in high-frequency edge regions and mismatching from noise-timestep discrepancies. Then, based on these findings and analyses, we propose a training-free spatial acceleration framework, dubbed Region-Adaptive Latent Upsampling (RALU), to mitigate those artifacts while achieving spatial acceleration of DiTs by our mixed-resolution latent upsampling. RALU achieves artifact-free, efficient acceleration with early upsampling only on artifact-prone edge regions and noise-timestep matching for different latent resolutions, leading to up to 7.0$\times$ speedup on FLUX-1.dev and 3.0$\times$ on Stable Diffusion 3 with negligible quality degradation. Furthermore, our RALU is complementarily applicable to existing temporal acceleration methods and timestep-distilled models, leading to up to 15.9$\times$ speedup.

Wongi Jeong, Kyungryeol Lee, Hoigi Seo, Se Young Chun• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	FLUX.1 (dev)	Image Reward0.9644	56
Text-to-Image Generation	FLUX.1 1024x1024 resolution (dev)	ImageReward1.028	20
Image Generation	Flux (test)	ImgReward0.94	14
Text-to-Image Generation	FLUX.1 dev native (test)	Speedup8.21	13
Text-to-Image Generation	FLUX.1 dev (test)	ImageReward1.028	13
Text-to-Image Generation	Qwen-Image native (test)	Speedup9.51	11

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord