Gaussian Variation Field Diffusion for High-fidelity Video-to-4D Synthesis
About
In this paper, we present a novel framework for video-to-4D generation that creates high-quality dynamic 3D content from single video inputs. Direct 4D diffusion modeling is extremely challenging due to costly data construction and the high-dimensional nature of jointly representing 3D shape, appearance, and motion. We address these challenges by introducing a Direct 4DMesh-to-GS Variation Field VAE that directly encodes canonical Gaussian Splats (GS) and their temporal variations from 3D animation data without per-instance fitting, and compresses high-dimensional animations into a compact latent space. Building upon this efficient representation, we train a Gaussian Variation Field diffusion model with temporal-aware Diffusion Transformer conditioned on input videos and canonical GS. Trained on carefully-curated animatable 3D objects from the Objaverse dataset, our model demonstrates superior generation quality compared to existing methods. It also exhibits remarkable generalization to in-the-wild video inputs despite being trained exclusively on synthetic data, paving the way for generating high-quality animated 3D content. Project page: https://gvfdiffusion.github.io/.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| 4D Mesh Reconstruction | Objaverse (test) | CD0.1157 | 13 | |
| 4D Synthesis | Monocular Video | FPS0.8 | 8 | |
| 4D mesh generation | Truebones Zoo (test) | CD0.1406 | 6 | |
| 3D Reconstruction | Objaverse Diffusion4D curated 1.0 (test) | P2S0.0345 | 5 | |
| 4D Motion Modeling | Motion-80 Short Sequence | CD0.197 | 5 | |
| video-to-4D object generation | video-to-4D object generation (test) | CLIP Score0.931 | 5 | |
| 4D Object Reconstruction | DeformingThings (test) | CD0.2806 | 5 | |
| Novel View Synthesis | Objaverse | PSNR17.31 | 5 | |
| 3D Motion Generation | 20 static meshes (test) | OC0.167 | 4 | |
| Text-to-motion generation | BIMO | Text-to-Motion Agreement (TA)2.343 | 4 |