Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

TinyFusion: Diffusion Transformers Learned Shallow

About

Diffusion Transformers have demonstrated remarkable capabilities in image generation but often come with excessive parameterization, resulting in considerable inference overhead in real-world applications. In this work, we present TinyFusion, a depth pruning method designed to remove redundant layers from diffusion transformers via end-to-end learning. The core principle of our approach is to create a pruned model with high recoverability, allowing it to regain strong performance after fine-tuning. To accomplish this, we introduce a differentiable sampling technique to make pruning learnable, paired with a co-optimized parameter to simulate future fine-tuning. While prior works focus on minimizing loss or error after pruning, our method explicitly models and optimizes the post-fine-tuning performance of pruned models. Experimental results indicate that this learnable paradigm offers substantial benefits for layer pruning of diffusion transformers, surpassing existing importance-based and error-based methods. Additionally, TinyFusion exhibits strong generalization across diverse architectures, such as DiTs, MARs, and SiTs. Experiments with DiT-XL show that TinyFusion can craft a shallow diffusion transformer at less than 7% of the pre-training cost, achieving a 2$\times$ speedup with an FID score of 2.86, outperforming competitors with comparable efficiency. Code is available at https://github.com/VainF/TinyFusion.

Gongfan Fang, Kunjun Li, Xinyin Ma, Xinchao Wang• 2024

Related benchmarks

TaskDatasetResultRank
Class-conditional Image GenerationImageNet 256x256 (train)
IS251
305
Text-to-Image GenerationGenEval
GenEval Score73.9
277
Image GenerationImageNet (val)
FID2.28
198
Text-to-Image GenerationDPG
Overall Score80.7
131
Text-to-Image GenerationDPG-Bench
DPG Score80.7
89
Text-to-Image GenerationOneIG-Bench--
33
Text-to-Image GenerationGenEval
GenEval Score73.9
16
Text-to-Image GenerationT2I-CompBench
B-VQA Score68.9
16
Long-text-to-Image GenerationLongText-Bench
EN Score85.9
15
Text-to-Image GenerationT2I-CompBench
B-VQA68.9
6
Showing 10 of 10 rows

Other info

Code

Follow for update