Genie 4D: Semantic-Prior-Guided 4D Dynamic Scene Reconstruction

About

At the intersection of computer vision and robotic perception, 4D reconstruction of dynamic scenes connects low-level geometric sensing with high-level semantic understanding. We present Genie 4D, a framework that turns hand-held phone capture into a semantically grounded, action-controllable 4D world model. Genie 4D couples a real-time visual-inertial Gaussian splatting front-end for metric geometry with a feed-forward 4D backbone regularized by frozen DINOv3 features acting as structural priors. The semantic priors suppress identity drift during dynamic tracking, while a short conditional diffusion refiner recovers high-frequency surface detail that regression backbones smooth away. Finally, a lightweight latent-action head exposes the reconstructed 4D state to a Genie-style world model trained with a JEPA-style next-embedding objective, so that the scene can be rolled forward under user actions. On the Point Odyssey and TUM-Dynamics benchmarks, Genie 4D retains the linear time complexity O(T) of feed-forward baselines while improving 3D tracking accuracy (APD) and reconstruction completeness, and it runs interactively on a single consumer GPU (RTX 5090) from iPhone, Mac, Windows, and Linux capture clients. Genie 4D offers a practical, semantic-prior-guided path toward physically grounded world models.

Yiru Yang, Zhuojie Wu, Nishant Kumar Singh, Max Schulthess• 2026

Related benchmarks

Task	Dataset	Result
World Coordinate 3D Reconstruction	TUM dynamics	--	9
Reconstruction Error	TUM dynamics	Chamfer Distance (cm)5.11	4
3D Tracking	Point Odyssey (test)	APD@0.1m41.8	3

Showing 3 of 3 rows

Other info

Follow for update

@wizwand_team Discord