Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VD3D: Taming Large Video Diffusion Transformers for 3D Camera Control

About

Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applications related to content creation, visual effects, and 3D vision. Recently, new methods demonstrate the ability to generate videos with controllable camera poses these techniques leverage pre-trained U-Net-based diffusion models that explicitly disentangle spatial and temporal generation. Still, no existing approach enables camera control for new, transformer-based video diffusion models that process spatial and temporal information jointly. Here, we propose to tame video transformers for 3D camera control using a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Pl\"ucker coordinates. The approach demonstrates state-of-the-art performance for controllable video generation after fine-tuning on the RealEstate10K dataset. To the best of our knowledge, our work is the first to enable camera control for transformer-based video diffusion models.

Sherwin Bahmani, Ivan Skorokhodov, Aliaksandr Siarohin, Willi Menapace, Guocheng Qian, Michael Vasilkovsky, Hsin-Ying Lee, Chaoyang Wang, Jiaxu Zou, Andrea Tagliasacchi, David B. Lindell, Sergey Tulyakov• 2024

Related benchmarks

TaskDatasetResultRank
3D Camera-Controlled Video SynthesisRealEstate10K (unseen camera trajectories)
TransErr0.409
9
3D Camera-Controlled Video SynthesisMSR-VTT unseen camera trajectories
TransErr0.486
9
Camera-controlled Video GenerationDL3DV-140
FID22.7
6
Camera-controlled Video GenerationRealEstate10K
FID21.4
6
Camera-controlled Video GenerationTanks&Temples
FID24.33
6
Showing 5 of 5 rows

Other info

Follow for update