Accelerating Diffusion Transformers with Token-wise Feature Caching

About

Diffusion transformers have shown significant effectiveness in both image and video synthesis at the expense of huge computation costs. To address this problem, feature caching methods have been introduced to accelerate diffusion transformers by caching the features in previous timesteps and reusing them in the following timesteps. However, previous caching methods ignore that different tokens exhibit different sensitivities to feature caching, and feature caching on some tokens may lead to 10$\times$ more destruction to the overall generation quality compared with other tokens. In this paper, we introduce token-wise feature caching, allowing us to adaptively select the most suitable tokens for caching, and further enable us to apply different caching ratios to neural layers in different types and depths. Extensive experiments on PixArt-$\alpha$, OpenSora, and DiT demonstrate our effectiveness in both image and video generation with no requirements for training. For instance, 2.36$\times$ and 1.93$\times$ acceleration are achieved on OpenSora and PixArt-$\alpha$ with almost no drop in generation quality.

Chang Zou, Xuyang Liu, Ting Liu, Siteng Huang, Linfeng Zhang• 2024

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	MJHQ-30K	Overall FID11.8	239
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)48.8	235
Text-to-Image Generation	MS-COCO (val)	FID8.84	202
Class-conditional Image Generation	ImageNet	FID2.58	174
Class-conditional Image Generation	ImageNet (val)	IS246.6	116
Text-to-Image Generation	Qwen-Image	--	96
Text-to-Image Generation	PartiPrompts	ImageReward1.09	92
Class-conditional Image Generation	ImageNet-1k (val)	FID3.04	79
Text-to-Image Generation	MS-COCO (30K)	FID (30K)22.05	72
Text-to-Image Generation	ImageReward	ImageReward Score1.202	69

Showing 10 of 41 rows

Other info

Follow for update

@wizwand_team Discord