VideoSSM: Autoregressive Long Video Generation with Hybrid State-Space Memory

About

Autoregressive (AR) diffusion enables streaming, interactive long-video generation by producing frames causally, yet maintaining coherence over minute-scale horizons remains challenging due to accumulated errors, motion drift, and content repetition. We approach this problem from a memory perspective, treating video synthesis as a recurrent dynamical process that requires coordinated short- and long-term context. We propose VideoSSM, a Long Video Model that unifies AR diffusion with a hybrid state-space memory. The state-space model (SSM) serves as an evolving global memory of scene dynamics across the entire sequence, while a context window provides local memory for motion cues and fine details. This hybrid design preserves global consistency without frozen, repetitive patterns, supports prompt-adaptive interaction, and scales in linear time with sequence length. Experiments on short- and long-range benchmarks demonstrate state-of-the-art temporal consistency and motion stability among autoregressive video generator especially at minute-scale horizons, enabling content diversity and interactive prompt-based control, thereby establishing a scalable, memory-aware framework for long video generation.

Yifei Yu, Xiaoshan Wu, Xinting Hu, Tao Hu, Yangtian Sun, Xiaoyang Lyu, Bo Wang, Lin Ma, Yuewen Ma, Zhongrui Wang, Xiaojuan Qi• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	VBench short video (test)	Subject Consistency80.22	16
Action Accuracy	MIND third-person view (test)	Trans. Error0.0201	6
Scene Reconstruction	MIND first-person view (test)	MSE0.0796	6
Scene Reconstruction	MIND third-person view (test)	MSE0.0928	6
Action Accuracy	MIND first-person view (test)	Translation Error3.79	6
3D Geometry Consistency	MIND first-person view (test)	Reprojection Error53.95	4
3D Geometry Consistency	MIND third-person view (test)	Reprojection Error61.26	4
Video Generation	60s Video Generation User Study 1.0 (test)	Rank 1 (%)41.07	4
Video Generation	VBench long-video (60s)	Temporal Flickering97.7	3

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord