MaskViT: Masked Visual Pre-Training for Video Prediction
About
The ability to predict future visual observations conditioned on past observations and motor commands can enable embodied agents to plan solutions to a variety of tasks in complex environments. This work shows that we can create good video prediction models by pre-training transformers via masked visual modeling. Our approach, named MaskViT, is based on two simple design decisions. First, for memory and training efficiency, we use two types of window attention: spatial and spatiotemporal. Second, during training, we mask a variable percentage of tokens instead of a fixed mask ratio. For inference, MaskViT generates all tokens via iterative refinement where we incrementally decrease the masking ratio following a mask scheduling function. On several datasets we demonstrate that MaskViT outperforms prior works in video prediction, is parameter efficient, and can generate high-resolution videos (256x256). Further, we demonstrate the benefits of inference speedup (up to 512x) due to iterative decoding by using MaskViT for planning on a real robot. Our work suggests that we can endow embodied agents with powerful predictive models by leveraging the general framework of masked visual modeling with minimal domain knowledge.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Prediction | BAIR (test) | FVD93.6 | 59 | |
| Video Prediction | BAIR Robot Pushing | FVD70.5 | 38 | |
| Video Prediction | Bair | FVD93.7 | 34 | |
| Frame prediction | Bair | FVD94 | 15 | |
| Robosuite push | VP2 benchmark | Mean Success Rate82.6 | 7 | |
| Upright block | VP2 | Mean Success Rate62 | 7 | |
| Blue button | VP2 benchmark | Mean Success Rate94.67 | 7 | |
| Flat block | VP2 benchmark | Mean Success Rate4 | 7 | |
| Video Prediction | RoboNet | FVD133.5 | 7 | |
| open drawer | VP2 benchmark | Mean Success Rate4 | 7 |