Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

MaskViT: Masked Visual Pre-Training for Video Prediction

About

The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.

Agrim Gupta, Stephen Tian, Yunzhi Zhang, Jiajun Wu, Roberto Mart\'in-Mart\'in, Li Fei-Fei• 2022

Related benchmarks

TaskDatasetResultRank
Video PredictionBAIR (test)
FVD93.6
59
Video PredictionBAIR Robot Pushing
FVD70.5
38
Video PredictionBair
FVD93.7
34
Frame predictionBair
FVD94
15
Robosuite pushVP2 benchmark
Mean Success Rate82.6
7
Upright blockVP2
Mean Success Rate62
7
Blue buttonVP2 benchmark
Mean Success Rate94.67
7
Flat blockVP2 benchmark
Mean Success Rate4
7
Video PredictionRoboNet
FVD133.5
7
open drawerVP2 benchmark
Mean Success Rate4
7
Showing 10 of 14 rows

Other info

Follow for update