GeometryCrafter: Consistent Geometry Estimation for Open-world Videos with Diffusion Priors
About
Despite remarkable advancements in video depth estimation, existing methods exhibit inherent limitations in achieving geometric fidelity through the affine-invariant predictions, limiting their applicability in reconstruction and other metrically grounded downstream tasks. We propose GeometryCrafter, a novel framework that recovers high-fidelity point map sequences with temporal coherence from open-world videos, enabling accurate 3D/4D reconstruction, camera parameter estimation, and other depth-based applications. At the core of our approach lies a point map Variational Autoencoder (VAE) that learns a latent space agnostic to video latent distributions for effective point map encoding and decoding. Leveraging the VAE, we train a video diffusion model to model the distribution of point map sequences conditioned on the input videos. Extensive evaluations on diverse datasets demonstrate that GeometryCrafter achieves state-of-the-art 3D accuracy, temporal consistency, and generalization capability.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Depth Estimation | Sintel | -- | 193 | |
| Depth Estimation | DIODE | Relative Error (REL)9.1 | 63 | |
| Video Depth Estimation | ScanNet | Rel^d7.3 | 29 | |
| Video Depth Estimation | Monkaa | Relative Error (Rel^d)13.4 | 18 | |
| Video Depth Estimation | KITTI | Relative Error (Rel^d)5 | 18 | |
| Video Depth Estimation | UrbanSyn | Relative Error (Rel^d)11 | 18 | |
| Video Depth Estimation | GMU | Relative Depth Error (Rel^d)7.7 | 18 | |
| Video Depth Estimation | Unreal4K | Relative Depth Error (Rel^d)20.7 | 18 | |
| Video pointmap evaluation | KITTI | Relp6.4 | 16 | |
| Video pointmap evaluation | GMU | Relp8.4 | 16 |