From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers

About

Diffusion Transformers (DiT) have revolutionized high-fidelity image and video synthesis, yet their computational demands remain prohibitive for real-time applications. To solve this problem, feature caching has been proposed to accelerate diffusion models by caching the features in the previous timesteps and then reusing them in the following timesteps. However, at timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, leading to a pronounced increase in errors introduced by feature caching, significantly harming the generation quality. To solve this problem, we propose TaylorSeer, which firstly shows that features of diffusion models at future timesteps can be predicted based on their values at previous timesteps. Based on the fact that features change slowly and continuously across timesteps, TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. Extensive experiments demonstrate its significant effectiveness in both image and video synthesis, especially in high acceleration ratios. For instance, it achieves an almost lossless acceleration of 4.99$\times$ on FLUX and 5.00$\times$ on HunyuanVideo without additional training. On DiT, it achieves $3.41$ lower FID compared with previous SOTA at $4.53$$\times$ acceleration. %Our code is provided in the supplementary materials and will be made publicly available on GitHub. Our codes have been released in Github:https://github.com/Shenyi-Z/TaylorSeer

Jiacheng Liu, Chang Zou, Yuanhuiyi Lyu, Junjie Chen, Linfeng Zhang• 2025

Related benchmarks

Task	Dataset	Result
Text-to-Image Generation	MJHQ-30K	Overall FID24.36	239
Video Depth Estimation	Sintel	Delta Threshold Accuracy (1.25)46.6	235
Image Generation	ImageNet 512x512 (val)	FID-50K3.51	219
Text-to-Image Generation	MS-COCO (val)	FID10.08	202
Class-conditional Image Generation	ImageNet	FID2.55	174
Class-conditional Image Generation	ImageNet (val)	IS223.8	116
Text-to-Image Generation	Qwen-Image	Image Reward1.18	96
Text-to-Image Generation	PartiPrompts	ImageReward0.9813	92
Text-to-Image Generation	COCO	FID34.74	79
Text-to-Image Generation	MS-COCO (30K)	FID (30K)29.66	72

Showing 10 of 72 rows

...

Other info

Follow for update

@wizwand_team Discord