SS4D: Native 4D Generative Model via Structured Spacetime Latents

About

We present SS4D, a native 4D generative model that synthesizes dynamic 3D objects directly from monocular video. Unlike prior approaches that construct 4D representations by optimizing over 3D or video generative models, we train a generator directly on 4D data, achieving high fidelity, temporal coherence, and structural consistency. At the core of our method is a compressed set of structured spacetime latents. Specifically, (1) To address the scarcity of 4D training data, we build on a pre-trained single-image-to-3D model, preserving strong spatial consistency. (2) Temporal consistency is enforced by introducing dedicated temporal layers that reason across frames. (3) To support efficient training and inference over long video sequences, we compress the latent sequence along the temporal axis using factorized 4D convolutions and temporal downsampling blocks. In addition, we employ a carefully designed training strategy to enhance robustness against occlusion

Zhibing Li, Mengchen Zhang, Tong Wu, Jing Tan, Jiaqi Wang, Dahua Lin• 2025

Related benchmarks

Task	Dataset	Result
4D Generation	Consistent4D	LPIPS0.149	40
4D Mesh Reconstruction	TexVerse (test)	CD-3D0.052	6
Video-to-4D generation	Helix4DBench 1.0 (test)	ULIP-20.4191	6
4D Mesh Reconstruction	ActionBench (test)	CD-3D0.105	6
4D Generation	DAVIS 2019 (test)	Geometry Quality4.497	5
4D Generation	ObjaverseDy (test)	LPIPS0.15	5
4D Generation	Single Dynamic Object	Generation Time (min)2	5
Dynamic Geometry Reconstruction	Truebone	Global CD-L20.0145	4
Dynamic Geometry Reconstruction	ActionBench	Global CD-L20.0882	4
Dynamic Geometry Reconstruction	Objaverse-XL (val)	Global CD-L20.0249	4

Showing 10 of 10 rows

Other info

GitHub

Follow for update

@wizwand_team Discord