Context as Memory: Scene-Consistent Interactive Long Video Generation with Memory Retrieval

About

Recent advances in interactive video generation have shown promising results, yet existing approaches struggle with scene-consistent memory capabilities in long video generation due to limited use of historical context. In this work, we propose Context-as-Memory, which utilizes historical context as memory for video generation. It includes two simple yet effective designs: (1) storing context in frame format without additional post-processing; (2) conditioning by concatenating context and frames to be predicted along the frame dimension at the input, requiring no external control modules. Furthermore, considering the enormous computational overhead of incorporating all historical context, we propose the Memory Retrieval module to select truly relevant context frames by determining FOV (Field of View) overlap between camera poses, which significantly reduces the number of candidate frames without substantial information loss. Experiments demonstrate that Context-as-Memory achieves superior memory capabilities in interactive long video generation compared to SOTAs, even generalizing effectively to open-domain scenarios not seen during training. The link of our project page is https://context-as-memory.github.io/.

Jiwen Yu, Jianhong Bai, Yiran Qin, Quande Liu, Xintao Wang, Pengfei Wan, Di Zhang, Xihui Liu• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	VBench	--	126
Motion Dynamics Modeling	MosaicMem Motion Dynamics (Dedicated Evaluation Set)	Dynamic Score1.72	12
Camera Motion Control	MosaicMem Camera Control (Dedicated Evaluation Set)	Rotational Error4.65	12
Memory Retrieval Consistency	MosaicMem Memory Retrieval (test)	SSIM49	12
Video Generation	MosaicMem Dedicated Evaluation Set Overall Generation	FID85.32	12
Video Generation	RealEstate10K and DL3DV partial-revisit (evaluation)	Total Quality Score78.07	11
Long Video Generation	DL3DV-Evaluation (test)	SSIM0.37	8
Long Video Generation	Tanks&Temples (test)	SSIM36.7	8
Video Generation	RealEstate10K Long-duration (test)	RotErr (°)10.87	8
3D Scene Generation	DL3DV (test)	LPIPS (P)0.433	7

Showing 10 of 37 rows

Other info

Follow for update

@wizwand_team Discord