Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MoStGAN-V: Video Generation with Temporal Motion Styles

About

Video generation remains a challenging task due to spatiotemporal complexity and the requirement of synthesizing diverse motions with temporal consistency. Previous works attempt to generate videos in arbitrary lengths either in an autoregressive manner or regarding time as a continuous signal. However, they struggle to synthesize detailed and diverse motions with temporal coherence and tend to generate repetitive scenes after a few time steps. In this work, we argue that a single time-agnostic latent vector of style-based generator is insufficient to model various and temporally-consistent motions. Hence, we introduce additional time-dependent motion styles to model diverse motion patterns. In addition, a Motion Style Attention modulation mechanism, dubbed as MoStAtt, is proposed to augment frames with vivid dynamics for each specific scale (i.e., layer), which assigns attention score for each motion style w.r.t deconvolution filter weights in the target synthesis layer and softly attends different motion styles for weight modulation. Experimental results show our model achieves state-of-the-art performance on four unconditional $256^2$ video synthesis benchmarks trained with only 3 frames per clip and produces better qualitative results with respect to dynamic motions. Code and videos have been made available at https://github.com/xiaoqian-shen/MoStGAN-V.

Xiaoqian Shen, Xiang Li, Mohamed Elhoseiny• 2023

Related benchmarks

TaskDatasetResultRank
Video GenerationUCF101--
54
Unconditional video generationUCF-101 256x256
FVD (256x256, 2048)1.38e+3
12
Video GenerationColonoscopic 25
FVD468.5
6
Video GenerationKvasir-Capsule 26
FVD82.8
6
Unconditional video generationFaceForensics 256^2
FVD (16 frames)39.7
5
Unconditional video generationSkyTimelapse 256^2
FVD1665.3
5
Unconditional video generationRainbowJelly 256^2
FVD1670.1
5
Unconditional video generationCelebV-HQ 256^2
FVD (16 frames)56.1
5
Text-conditional Video GenerationMUGEN
FVD129.8
4
Showing 9 of 9 rows

Other info

Code

Follow for update