Causal Forcing: Autoregressive Diffusion Distillation Done Right for High-Quality Real-Time Interactive Video Generation

About

To achieve real-time interactive video generation, current methods distill pretrained bidirectional video diffusion models into few-step autoregressive (AR) models, facing an architectural gap when full attention is replaced by causal attention. However, existing approaches do not bridge this gap theoretically. They initialize the AR student via ODE distillation, which requires frame-level injectivity, where each noisy frame must map to a unique clean frame under the PF-ODE of an AR teacher. Distilling an AR student from a bidirectional teacher violates this condition, preventing recovery of the teacher's flow map and instead inducing a conditional-expectation solution, which degrades performance. To address this issue, we propose Causal Forcing, which uses an autoregressive teacher for ODE initialization to bridge the architectural gap, and then applies the same DMD procedure as in Self Forcing. Empirical results show that our method outperforms all baselines across all metrics, surpassing the SOTA Self Forcing by 19.3\% in Dynamic Degree, 8.7\% in VisionReward, and 16.7\% in Instruction Following. Project page: \href{https://thu-ml.github.io/CausalForcing.github.io/}{https://thu-ml.github.io/CausalForcing.github.io/}; the code: \href{https://github.com/thu-ml/Causal-Forcing}{https://github.com/thu-ml/Causal-Forcing}.

Hongzhou Zhu, Min Zhao, Guande He, Hang Su, Chongxuan Li, Jun Zhu• 2026

Related benchmarks

Task	Dataset	Result
Long Video Generation	VBench-Long 60 seconds	Subject Consistency94.62	74
Video Generation	VBench 5s	Quality Score85.41	73
Video Generation	VBench (test)	Semantic Score70.97	66
Video Generation	VBench Long	Motion Smoothness97.67	49
Video Generation	short videos 81-frames 240 prompts	Total Score5.4	38
Text-to-Video Generation	VBench (test)	Total Score78.39	37
Long Video Generation	VBench	Overall Score84.04	35
Video Generation	VBench 2.0	Human Fidelity0.886	26
Video Generation	VideoAlign	VQ Score3.97	26
Long Video Generation	VBenchLong 30-second	Dynamic Degree97.14	22

Showing 10 of 28 rows

Other info

GitHub

Follow for update

@wizwand_team Discord