Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Self Forcing: Bridging the Train-Test Gap in Autoregressive Video Diffusion

About

We introduce Self Forcing, a novel training paradigm for autoregressive video diffusion models. It addresses the longstanding issue of exposure bias, where models trained on ground-truth context must generate sequences conditioned on their own imperfect outputs during inference. Unlike prior methods that denoise future frames based on ground-truth context frames, Self Forcing conditions each frame's generation on previously self-generated outputs by performing autoregressive rollout with key-value (KV) caching during training. This strategy enables supervision through a holistic loss at the video level that directly evaluates the quality of the entire generated sequence, rather than relying solely on traditional frame-wise objectives. To ensure training efficiency, we employ a few-step diffusion model along with a stochastic gradient truncation strategy, effectively balancing computational cost and performance. We further introduce a rolling KV cache mechanism that enables efficient autoregressive video extrapolation. Extensive experiments demonstrate that our approach achieves real-time streaming video generation with sub-second latency on a single GPU, while matching or even surpassing the generation quality of significantly slower and non-causal diffusion models. Project website: http://self-forcing.github.io/

Xun Huang, Zhengqi Li, Guande He, Mingyuan Zhou, Eli Shechtman• 2025

Related benchmarks

TaskDatasetResultRank
Text-to-Video GenerationVBench
Quality Score85.25
111
Video GenerationVBench--
102
Video GenerationVBench 5s
Total Score84.31
35
Video GenerationVBench short video (test)
Subject Consistency80.14
16
Video GenerationVBench Overall
Throughput (FPS)17
11
Short Video GenerationVBench 2024
Total Score84.31
11
Short Video GenerationVBench official prompts
Total Score83.8
11
Video GenerationVBench 30-second generation
Imaging Quality83.82
11
Video GenerationSingle-prompt 5-second setting
Total Score84.31
11
Video GenerationMAG-Bench
PSNR15.65
10
Showing 10 of 46 rows

Other info

Follow for update