GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors

About

Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.

Tian-Xing Xu, Xiangjun Gao, Wenbo Hu, Xiaoyu Li, Song-Hai Zhang, Ying Shan• 2025

Related benchmarks

Task	Dataset	Result
Video Depth Estimation	Sintel	--	235
Depth Estimation	DIODE	--	92
Video Depth Estimation	ScanNet	Rel^d7.3	29
Video pointmap evaluation	KITTI	Relp0.084	24
Video Depth Estimation	KITTI	Relative Error (Rel^d)5	23
Monocular Geometry Estimation	7 real-world evaluation datasets (DIODE, KITTI, NYUv2, ETH3D, HAMMER, iBims-1) (Average)	Relp5.45	19
Video Depth Estimation	Monkaa	Relative Error (Rel^d)13.4	18
Video Depth Estimation	UrbanSyn	Relative Error (Rel^d)11	18
Video Depth Estimation	GMU	Relative Depth Error (Rel^d)7.7	18
Video Depth Estimation	Unreal4K	Relative Depth Error (Rel^d)20.7	18

Showing 10 of 42 rows

Other info

Follow for update

@wizwand_team Discord