Adaptive Caching for Faster Video Generation with Diffusion Transformers
About
Generating temporally-consistent high-fidelity videos can be computationally expensive, especially over longer temporal spans. More-recent Diffusion Transformers (DiTs) -- despite making significant headway in this context -- have only heightened such challenges as they rely on larger models and heavier attention mechanisms, resulting in slower inference speeds. In this paper, we introduce a training-free method to accelerate video DiTs, termed Adaptive Caching (AdaCache), which is motivated by the fact that "not all videos are created equal": meaning, some videos require fewer denoising steps to attain a reasonable quality than others. Building on this, we not only cache computations through the diffusion process, but also devise a caching schedule tailored to each video generation, maximizing the quality-latency trade-off. We further introduce a Motion Regularization (MoReg) scheme to utilize video information within AdaCache, essentially controlling the compute allocation based on motion content. Altogether, our plug-and-play contributions grant significant inference speedups (e.g. up to 4.7x on Open-Sora 720p - 2s video generation) without sacrificing the generation quality, across multiple video DiT baselines.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Depth Estimation | Sintel | Delta Threshold Accuracy (1.25)48.3 | 235 | |
| Text-to-Image Generation | MS-COCO | FID7.82 | 145 | |
| Class-conditional Image Generation | ImageNet (val) | -- | 116 | |
| Camera pose estimation | Sintel dataset | ATE0.193 | 35 | |
| Text-to-Video Generation | HunyuanVideo (test) | Quality Score81.78 | 23 | |
| Video Generation | Image and Video Generation | FID4.64 | 20 | |
| Image-to-Video Generation | VBench 1.5 (test) | I2V Score91.88 | 15 | |
| Image Generation | ImageNet-256 (test) | FID4.75 | 11 | |
| Visual Planning | VBench RealEstate10K (val) | Subject Consistency91.94 | 7 | |
| Video Generation | DiT-XL 2 | FID4.64 | 5 |