Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Token Caching for Diffusion Transformer Acceleration

About

Diffusion transformers have gained substantial interest in diffusion generative modeling due to their outstanding performance. However, their computational demands, particularly the quadratic complexity of attention mechanisms and multi-step inference processes, present substantial bottlenecks that limit their practical applications. To address these challenges, we propose TokenCache, a novel acceleration method that leverages the token-based multi-block architecture of transformers to reduce redundant computations. TokenCache tackles three critical questions: (1) Which tokens should be pruned and reused by the caching mechanism to eliminate redundancy? (2) Which blocks should be targeted for efficient caching? (3) At which time steps should caching be applied to balance speed and quality? In response to these challenges, TokenCache introduces a Cache Predictor that hierarchically addresses these issues by (1) Token pruning: assigning importance scores to each token to determine which tokens to prune and reuse; (2) Block selection: allocating pruning ratio to each block to adaptively select blocks for caching; (3) Temporal Scheduling: deciding at which time steps to apply caching strategies. Experimental results across various models demonstrate that TokenCache achieves an effective trade-off between generation quality and inference speed for diffusion transformers.

Jinming Lou, Wenyang Luo, Yufan Liu, Bing Li, Xinmiao Ding, Weiming Hu, Yuming Li, Chenguang Ma• 2024

Related benchmarks

TaskDatasetResultRank
Text-to-Image GenerationMS-COCO 2017 (val)
FID28.83
80
Text-to-Image GenerationPartiPrompts
CLIP Score0.3226
26
Showing 2 of 2 rows

Other info

Follow for update