Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Long-Context Autoregressive Video Modeling with Next-Frame Prediction

About

Long-context video modeling is essential for enabling generative models to function as world simulators, as they must maintain temporal coherence over extended time spans. However, most existing models are trained on short clips, limiting their ability to capture long-range dependencies, even with test-time extrapolation. While training directly on long videos is a natural solution, the rapid growth of vision tokens makes it computationally prohibitive. To support exploring efficient long-context video modeling, we first establish a strong autoregressive baseline called Frame AutoRegressive (FAR). FAR models temporal dependencies between continuous frames, converges faster than video diffusion transformers, and outperforms token-level autoregressive models. Based on this baseline, we observe context redundancy in video autoregression. Nearby frames are critical for maintaining temporal consistency, whereas distant frames primarily serve as context memory. To eliminate this redundancy, we propose the long short-term context modeling using asymmetric patchify kernels, which apply large kernels to distant frames to reduce redundant tokens, and standard kernels to local frames to preserve fine-grained detail. This significantly reduces the training cost of long videos. Our method achieves state-of-the-art results on both short and long video generation, providing an effective baseline for long-context autoregressive video modeling.

Yuchao Gu, Weijia Mao, Mike Zheng Shou• 2025

Related benchmarks

TaskDatasetResultRank
Class-Conditional Video GenerationUCF101--
19
Class-to-video generationUCF-101
FVD57
13
Video PredictionBAIR 64x64 (test)
SSIM0.849
12
Long-Context Video PredictionDMLab 64x64
FVD54
12
Video GenerationUCF-101 64 x 64 (test)
FVD194.1
12
Time Series ForecastingGreenEarthNet 1.0 (test)
PSNR (NDVI)17.53
9
Unconditional video generationUCF-101
FVD (2048 Dim)279
7
Time Series ForecastingTS-S12 S2-Sentinel-2 (full-band)
PSNR16.23
7
Long-Context Video PredictionMinecraft 128x128 (test)
SSIM0.448
6
Video GenerationTECO–Minecraft 128x128
LPIPS0.251
6
Showing 10 of 11 rows

Other info

Code

Follow for update