MotionCrafter: Dense Geometry and Motion Reconstruction with a 4D VAE
About
We introduce MotionCrafter, a video diffusion-based framework that jointly reconstructs 4D geometry and estimates dense motion from a monocular video. The core of our method is a novel joint representation of dense 3D point maps and 3D scene flows in a shared coordinate system, and a novel 4D VAE to effectively learn this representation. Unlike prior work that forces the 3D value and latents to align strictly with RGB VAE latents-despite their fundamentally different distributions-we show that such alignment is unnecessary and leads to suboptimal performance. Instead, we introduce a new data normalization and VAE training strategy that better transfers diffusion priors and greatly improves reconstruction quality. Extensive experiments across multiple datasets demonstrate that MotionCrafter achieves state-of-the-art performance in both geometry reconstruction and dense scene flow estimation, delivering 38.64% and 25.0% improvements in geometry and motion reconstruction, respectively, all without any post-optimization. Project page: https://ruijiezhu94.github.io/MotionCrafter_Page
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Dense Tracking | Kubric | EPE4.6 | 11 | |
| Geometric Reconstruction | Monkaa (test) | Relp25.88 | 8 | |
| Geometric Reconstruction | Sintel (test) | Relp32.46 | 8 | |
| Geometric Reconstruction | DDAD (test) | Relp21.27 | 8 | |
| World-centric geometry reconstruction | Kubric | Rel^p3.4 | 7 | |
| World-centric geometry reconstruction | Dynamic Replica | Rel^p4.04 | 7 | |
| World-centric geometry reconstruction | Point Odyssey | Rel^p9.94 | 7 | |
| World-centric motion reconstruction | vKITTI 2 | EPE71.75 | 7 | |
| World-centric geometry reconstruction | VKITTI2 | Rel^p14.6 | 7 | |
| World-centric motion reconstruction | Spring | Endpoint Error5.61 | 7 |