MaskViT: Masked Visual Pre-Training for Video Prediction

About

The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.

Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Mart\'in-Mart\'in, Li Fei-Fei• 2022

Related benchmarks

Task	Dataset	Result
Video Prediction	BAIR (test)	FVD93.6	59
Video Prediction	BAIR Robot Pushing	FVD70.5	38
Video Prediction	Bair	FVD93.7	34
Video Generation	Bair	FVD Score94	22
Frame prediction	Bair	FVD94	15
Video Prediction	RoboNet	FVD133.5	13
Robosuite push	VP2 benchmark	Mean Success Rate82.6	7
Upright block	VP2	Mean Success Rate62	7
Blue button	VP2 benchmark	Mean Success Rate94.67	7
Flat block	VP2 benchmark	Mean Success Rate4	7

Showing 10 of 15 rows

Other info

Follow for update

@wizwand_team Discord