Geo4D: Leveraging Video Generators for Geometric 4D Scene Reconstruction
About
We introduce Geo4D, a method to repurpose video diffusion models for monocular 3D reconstruction of dynamic scenes. By leveraging the strong dynamic priors captured by large-scale pre-trained video models, Geo4D can be trained using only synthetic data while generalizing well to real data in a zero-shot manner. Geo4D predicts several complementary geometric modalities, namely point, disparity, and ray maps. We propose a new multi-modal alignment algorithm to align and fuse these modalities, as well as a sliding window approach at inference time, thus enabling robust and accurate 4D reconstruction of long videos. Extensive experiments across multiple benchmarks show that Geo4D significantly surpasses state-of-the-art video depth estimation methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Camera pose estimation | TUM dynamics | RRE0.48 | 57 | |
| Video Depth Estimation | TUM dynamics | Abs Rel0.175 | 27 | |
| Geometric Reconstruction | DDAD (test) | Relp14.58 | 8 | |
| Geometric Reconstruction | Monkaa (test) | Relp28.04 | 8 | |
| Geometric Reconstruction | Sintel (test) | Relp34.61 | 8 | |
| Camera pose estimation | Bonn 3 scenes | ATE36.7 | 5 | |
| Video Depth Estimation | BEDLAM | Abs Rel0.058 | 5 | |
| Video Depth Estimation | Bonn 3 scenes | Abs Rel0.087 | 5 | |
| Camera pose estimation | GTA-IM | ATE0.107 | 5 | |
| Video Depth Estimation | GTA-IM | Abs Rel0.218 | 5 |