High Fidelity Video Prediction with Large Stochastic Recurrent Neural Networks

About

Predicting future video frames is extremely challenging, as there are many factors of variation that make up the dynamics of how frames change through time. Previously proposed solutions require complex inductive biases inside network architectures with highly specialized computation, including segmentation masks, optical flow, and foreground and background separation. In this work, we question if such handcrafted architectures are necessary and instead propose a different approach: finding minimal inductive bias for video prediction while maximizing network capacity. We investigate this question by performing the first large-scale empirical study and demonstrate state-of-the-art performance by learning large models on three different datasets: one for modeling object interactions, one for modeling human motion, and one for modeling car driving.

Ruben Villegas, Arkanath Pathak, Harini Kannan, Dumitru Erhan, Quoc V. Le, Honglak Lee• 2019

Related benchmarks

Task	Dataset	Result
Video Prediction	Human3.6M 4 frames -> 4 frames	PSNR32.11	20
Video Prediction	RoboNet	FVD123.2	13
Blue button	VP2 benchmark	Mean Success Rate97.33	7
Open slide	VP2 benchmark	Mean Success Rate57.33	7
Red button	VP2 benchmark	Mean Success Rate76	7
Robosuite push	VP2 benchmark	Mean Success Rate79.8	7
open drawer	VP2 benchmark	Mean Success Rate16.67	7
Green button	VP2 benchmark	Mean Success Rate81.33	7
Upright block	VP2	Mean Success Rate48.67	7
Video Prediction	RoboNet (test)	FVD123.2	7

Showing 10 of 11 rows

Other info

Follow for update

@wizwand_team Discord