Improved Conditional VRNNs for Video Prediction

About

Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different datasets.

Lluis Castrejon, Nicolas Ballas, Aaron Courville• 2019

Related benchmarks

Task	Dataset	Result
Video Prediction	BAIR Push (test)	FVD121.3	30
Video Generation	Bair	FVD Score134.8	22
Video Synthesis	iPER (test)	FVD181.5	11
Video Prediction	Cityscapes 128x128 resolution (test)	FVD567.5	9
Video Prediction	Human3.6M (test)	FVD465.6	9
Video Generation	KTH	FVD Score67.26	8
Video Generation	Human3.6M	FVD Score523.5	5
Video Prediction	Tai-Chi-HD (test)	FVD202.3	4
Video Prediction	Poking-Plants (PP) (test)	FVD184.5	4
Video Prediction	Stochastic Moving MNIST (test)	FVD57.17	3

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord