Lumiere: A Space-Time Diffusion Model for Video Generation
About
We introduce Lumiere -- a text-to-video diffusion model designed for synthesizing videos that portray realistic, diverse and coherent motion -- a pivotal challenge in video synthesis. To this end, we introduce a Space-Time U-Net architecture that generates the entire temporal duration of the video at once, through a single pass in the model. This is in contrast to existing video models which synthesize distant keyframes followed by temporal super-resolution -- an approach that inherently makes global temporal consistency difficult to achieve. By deploying both spatial and (importantly) temporal down- and up-sampling and leveraging a pre-trained text-to-image diffusion model, our model learns to directly generate a full-frame-rate, low-resolution video by processing it in multiple space-time scales. We demonstrate state-of-the-art text-to-video generation results, and show that our design easily facilitates a wide range of content creation tasks and video editing applications, including image-to-video, video inpainting, and stylized generation.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Generation | Physics-IQ | Phys. IQ Score23 | 45 | |
| Text-to-Video Generation | UCF-101 (test) | FVD332.5 | 25 | |
| Background layer reconstruction | Synthetic Movie scenes OmnimatteRF benchmark (test) | PSNR29.04 | 13 | |
| Background layer reconstruction | Synthetic Kubric scenes OmnimatteRF (test) | PSNR31.46 | 6 | |
| Physical Plausibility Evaluation | Physics-IQ (modified) | Solid Mechanics Score27.3 | 6 |