Spatia: Video Generation with Updatable Spatial Memory

About

Existing video generation models struggle to maintain long-term spatial and temporal consistency due to the dense, high-dimensional nature of video signals. To overcome this limitation, we propose Spatia, a spatial memory-aware video generation framework that explicitly preserves a 3D scene point cloud as persistent spatial memory. Spatia iteratively generates video clips conditioned on this spatial memory and continuously updates it through visual SLAM. This dynamic-static disentanglement design enhances spatial consistency throughout the generation process while preserving the model's ability to produce realistic dynamic entities. Furthermore, Spatia enables applications such as explicit camera control and 3D-aware interactive editing, providing a geometrically grounded framework for scalable, memory-driven video generation.

Jinjing Zhao, Fangyun Wei, Zhening Liu, Hongyang Zhang, Chang Xu, Yan Lu• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	WorldScore (test)	Average Score69.73	27
Camera-controlled Video Generation	RealEstate10K	FVD163.9	14
Single-event Scene Revisit (Same Pose)	LiveBench	PSNR (Background)20.132	8
Single-event Scene Revisit (Different Pose)	LiveBench	DINO Feature Similarity (FG)0.392	8
Image-to-Video Generation	RealEstate 122 (test)	PSNR18.58	6
Spatial Memory Consistency	WorldScore Subset	PSNRC19.38	4

Showing 6 of 6 rows

Other info

GitHub

Follow for update

@wizwand_team Discord