Token Caching for Diffusion Transformer Acceleration

About

Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their computational demands, particularly the quadratic complexity of attention mechanisms and multi-step inference processes, present substantial bottlenecks that limit their practical applications. To address these challenges, we propose TokenCache, a novel acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations. TokenCache tackles three critical questions: (1) Which tokens should be pruned and reused by the caching mechanism to eliminate redundancy? (2) Which blocks should be targeted for efficient caching? (3) At which time steps should caching be applied to balance speed and quality? In response to these challenges, TokenCache introduces a Cache Predictor that hierarchically addresses these issues by (1) Token pruning: assigning importance scores to each token to determine which tokens to prune and reuse; (2) Block selection: allocating pruning ratio to each block to adaptively select blocks for caching; (3) Temporal Scheduling: deciding at which time steps to apply caching strategies. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers.

Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Yuming Li, Chenguang Ma• 2024

Related benchmarks

Task	Dataset	Result	Rank
Text-to-Image Generation	MS-COCO 2017 (val)	FID28.83		131
Text-to-Image Generation	PartiPrompts	CLIP Score0.3226		26

Showing 2 of 2 rows

Other info

Follow for update

@wizwand_team Discord