Transformation-based Adversarial Video Prediction on Large-Scale Data
About
Recent breakthroughs in adversarial generative modeling have led to models capable of producing video samples of high quality, even on large and complex datasets of real-world video. In this work, we focus on the task of video prediction, where given a sequence of frames extracted from a video, the goal is to generate a plausible future sequence. We first improve the state of the art by performing a systematic empirical study of discriminator decompositions and proposing an architecture that yields faster convergence and higher performance than previous approaches. We then analyze recurrent units in the generator, and propose a novel recurrent unit which transforms its past hidden state according to predicted motion-like features, and refines it to handle dis-occlusions, scene changes and other complex behavior. We show that this recurrent unit consistently outperforms previous designs. Our final model leads to a leap in the state-of-the-art performance, obtaining a test set Frechet Video Distance of 25.7, down from 69.2, on the large-scale Kinetics-600 dataset.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Prediction | BAIR (test) | FVD103.3 | 59 | |
| Video Prediction | Kinetics-600 (test) | FVD25.7 | 46 | |
| Video Prediction | BAIR Robot Pushing | FVD103 | 38 | |
| Video Prediction | Bair | FVD103.3 | 34 | |
| Video Frame Prediction | Kinetics-600 | gFVD25.7 | 28 | |
| Video Prediction | Kinetics-600 | FVD25.7 | 18 | |
| Future video prediction | BAIR 64x64 and 256x256 (test) | FVD103 | 16 | |
| Frame prediction | Bair | FVD103 | 15 | |
| Video modeling | BAIR Robot Pushing (test) | FVD103.3 | 14 | |
| Video Prediction | BAIR 64x64 | FVD103 | 14 |