Improved Conditional VRNNs for Video Prediction
About
Predicting future frames for a video sequence is a challenging generative modeling task. Promising approaches include probabilistic latent variable models such as the Variational Auto-Encoder. While VAEs can handle uncertainty and model multiple possible future outcomes, they have a tendency to produce blurry predictions. In this work we argue that this is a sign of underfitting. To address this issue, we propose to increase the expressiveness of the latent distributions and to use higher capacity likelihood models. Our approach relies on a hierarchy of latent variables, which defines a family of flexible prior and posterior distributions in order to better model the probability of future sequences. We validate our proposal through a series of ablation experiments and compare our approach to current state-of-the-art latent variable models. Our method performs favorably under several metrics in three different datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Prediction | BAIR Push (test) | FVD121.3 | 30 | |
| Video Synthesis | iPER (test) | FVD181.5 | 11 | |
| Video Prediction | Cityscapes 128x128 resolution (test) | FVD567.5 | 9 | |
| Video Prediction | Human3.6M (test) | FVD465.6 | 9 | |
| Video Generation | KTH | FVD Score67.26 | 8 | |
| Video Generation | Bair | FVD Score134.8 | 7 | |
| Video Generation | Human3.6M | FVD Score523.5 | 5 | |
| Video Prediction | Tai-Chi-HD (test) | FVD202.3 | 4 | |
| Video Prediction | Poking-Plants (PP) (test) | FVD184.5 | 4 | |
| Video Prediction | Stochastic Moving MNIST (test) | FVD57.17 | 3 |