Less is Enough: Training-Free Video Diffusion Acceleration via Runtime-Adaptive Caching

About

Video generation models have demonstrated remarkable performance, yet their broader adoption remains constrained by slow inference speeds and substantial computational costs, primarily due to the iterative nature of the denoising process. Addressing this bottleneck is essential for democratizing advanced video synthesis technologies and enabling their integration into real-world applications. This work proposes EasyCache, a training-free acceleration framework for video diffusion models. EasyCache introduces a lightweight, runtime-adaptive caching mechanism that dynamically reuses previously computed transformation vectors, avoiding redundant computations during inference. Unlike prior approaches, EasyCache requires no offline profiling, pre-computation, or extensive parameter tuning. We conduct comprehensive studies on various large-scale video generation models, including OpenSora, Wan2.1, and HunyuanVideo. Our method achieves leading acceleration performance, reducing inference time by up to 2.1-3.3$\times$ compared to the original baselines while maintaining high visual fidelity with a significant up to 36% PSNR improvement compared to the previous SOTA method. This improvement makes our EasyCache a efficient and highly accessible solution for high-quality video generation in both research and practical applications. The code is available at https://github.com/H-EmbodVis/EasyCache.

Xin Zhou, Dingkang Liang, Kaijin Chen, Tianrui Feng, Xiwu Chen, Hongkai Lin, Yikang Ding, Feiyang Tan, Hengshuang Zhao, Xiang Bai• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	Qwen-Image	Image Reward1.282	96
Video Generation	Wan 1.3B (81 frames, 832×480) 2.1	VBench Score80.49	21
Image Generation	FLUX.1 (dev)	Image Reward0.986	20
Image2World generation	PAI-Bench	Domain Score Average0.8399	17
Video Generation	HunyuanVideo 129 frames 544P	VBench Score82.04	15
Text2World (T2W) Generation	PAI-Bench	Latency (s)41.41	14
Text-to-Video Generation	VBench 1.0 (test)	VBench Score81.51	13
Video Generation	Open-Sora 51 frames 848 x 480 1.2	Latency (s)34.55	11
Image-to-World generation	HunyuanVoyager-13B (test)	WorldScore Static64.16	9
Image-to-World generation	Aether-5B (test)	WorldScore Static62.89	9

Showing 10 of 17 rows

Other info

Follow for update

@wizwand_team Discord