Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Controlling Space and Time with Diffusion Models

About

We present 4DiM, a cascaded diffusion model for 4D novel view synthesis (NVS), supporting generation with arbitrary camera trajectories and timestamps, in natural scenes, conditioned on one or more images. With a novel architecture and sampling procedure, we enable training on a mixture of 3D (with camera pose), 4D (pose+time) and video (time but no pose) data, which greatly improves generalization to unseen images and camera pose trajectories over prior works that focus on limited domains (e.g., object centric). 4DiM is the first-ever NVS method with intuitive metric-scale camera pose control enabled by our novel calibration pipeline for structure-from-motion-posed data. Experiments demonstrate that 4DiM outperforms prior 3D NVS models both in terms of image fidelity and pose alignment, while also enabling the generation of scene dynamics. 4DiM provides a general framework for a variety of tasks including single-image-to-3D, two-image-to-video (interpolation and extrapolation), and pose-conditioned video-to-video translation, which we illustrate qualitatively on a variety of scenes. For an overview see https://4d-diffusion.github.io

Daniel Watson, Saurabh Saxena, Lala Li, Andrea Tagliasacchi, David J. Fleet• 2024

Related benchmarks

TaskDatasetResultRank
Video GenerationVBench--
102
Camera-controlled Video GenerationProposed Dataset 1800 generated videos
RelRot3.66
4
4D Scene SynthesisNSFF Fixed Viewpoint, Varying Time
PSNR19.77
2
4D Scene SynthesisNSFF Varying Viewpoint, Fixed Time
PSNR18.81
2
4D Scene SynthesisNSFF Varying Viewpoint, Varying Time
PSNR17.28
2
Showing 5 of 5 rows

Other info

Follow for update