Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

SEINE: Short-to-Long Video Diffusion Model for Generative Transition and Prediction

About

Recently video generation has achieved substantial progress with realistic results. Nevertheless, existing AI-generated videos are usually very short clips ("shot-level") depicting a single scene. To deliver a coherent long video ("story-level"), it is desirable to have creative transition and prediction effects across different clips. This paper presents a short-to-long video diffusion model, SEINE, that focuses on generative transition and prediction. The goal is to generate high-quality long videos with smooth and creative transitions between scenes and varying lengths of shot-level videos. Specifically, we propose a random-mask video diffusion model to automatically generate transitions based on textual descriptions. By providing the images of different scenes as inputs, combined with text-based control, our model generates transition videos that ensure coherence and visual quality. Furthermore, the model can be readily extended to various tasks such as image-to-video animation and autoregressive video prediction. To conduct a comprehensive evaluation of this new generative task, we propose three assessing criteria for smooth and creative transition: temporal consistency, semantic similarity, and video-text semantic alignment. Extensive experiments validate the effectiveness of our approach over existing methods for generative transition and prediction, enabling the creation of story-level long videos. Project page: https://vchitect.github.io/SEINE-project/ .

Xinyuan Chen, Yaohui Wang, Lingjun Zhang, Shaobin Zhuang, Xin Ma, Jiashuo Yu, Yali Wang, Dahua Lin, Yu Qiao, Ziwei Liu• 2023

Related benchmarks

TaskDatasetResultRank
Video GenerationVBench
Quality Score70.97
126
Video GenerationPhysics-IQ
Phys. IQ Score29.13
63
Image-to-Video GenerationVBench
Motion Smoothness0.9665
28
Image-to-Video GenerationVBench I2V
Background Consistency97.26
24
Video GenerationKinetics-600
FVD332.8
22
Image-to-Video GenerationVBench I2V 1.0 (test)
Subject Consistency96.57
13
Human Image AnimationCurated (test)
CPBD0.6218
9
Image-to-Video GenerationVBench-I2V general
Subject Consistency (I2V)96.6
8
First-Person Video GenerationEpic100
Reasonability7.2
6
Video GenerationEgoFHO
FVD316.6
6
Showing 10 of 15 rows

Other info

Follow for update