iVideoGPT: Interactive VideoGPTs are Scalable World Models
About
World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Prediction | BAIR Robot Pushing | FVD60.8 | 38 | |
| Block-stacking | Video-CraftBench | Success Rate (Human)15.3 | 14 | |
| Sequential Paper Folding | Video-CraftBench | Step 1 Success Rate23.1 | 14 | |
| Video Generation | Video-CraftBench | SSIM58.8 | 14 | |
| Class-Conditional Video Generation | Benchmark 17x256x256 resolution (test) | gFVD254.8 | 9 | |
| Red button | VP2 benchmark | Mean Success Rate92.22 | 7 | |
| open drawer | VP2 benchmark | Mean Success Rate37.5 | 7 | |
| Video Prediction | RoboNet | FVD63.2 | 7 | |
| Blue button | VP2 benchmark | Mean Success Rate95.56 | 7 | |
| Green button | VP2 benchmark | Mean Success Rate82.5 | 7 |