Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Rolling Forcing: Autoregressive Long Video Diffusion in Real Time

About

Streaming video generation, as one fundamental component in interactive world models and neural game engines, aims to generate high-quality, low-latency, and temporally coherent long video streams. However, most existing work suffers from severe error accumulation that often significantly degrades the generated stream videos over long horizons. We design Rolling Forcing, a novel video generation technique that enables streaming long videos with minimal error accumulation. Rolling Forcing comes with three novel designs. First, instead of iteratively sampling individual frames, which accelerates error propagation, we design a joint denoising scheme that simultaneously denoises multiple frames with progressively increasing noise levels. This design relaxes the strict causality across adjacent frames, effectively suppressing error growth. Second, we introduce the attention sink mechanism into the long-horizon stream video generation task, which allows the model to keep key value states of initial frames as a global context anchor and thereby enhances long-term global consistency. Third, we design an efficient training algorithm that enables few-step distillation over largely extended denoising windows. This algorithm operates on non-overlapping windows and mitigates exposure bias conditioned on self-generated histories. Extensive experiments show that Rolling Forcing enables real-time streaming generation of multi-minute videos on a single GPU, with substantially reduced error accumulation.

Kunhao Liu, Wenbo Hu, Jiale Xu, Ying Shan, Shijian Lu• 2025

Related benchmarks

TaskDatasetResultRank
Video GenerationVBench--
126
Video GenerationVBench 5s
Total Score83.32
58
Video Generationshort videos 81-frames 240 prompts
Total Score5.25
38
Video GenerationVBench Long
Semantic Score78.03
23
Long Video Generation120, 240, 720 and 1440-frames long videos
Total Score6.86
20
Video GenerationVBench short video (test)
Subject Consistency69.78
16
Short Video GenerationVBench-Long 60 seconds
Aesthetic Quality58.34
13
Long Video GenerationVBench-Long 60 seconds
Subject Consistency97.94
12
Video GenerationVBench Short-Duration extended prompt suite
Total Score82.95
12
Short Video GenerationVBench 2024
Total Score81.22
11
Showing 10 of 35 rows

Other info

Follow for update