Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

iVideoGPT: Interactive VideoGPTs are Scalable World Models

About

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long• 2024

Related benchmarks

TaskDatasetResultRank
Video PredictionBAIR Robot Pushing
FVD60.8
38
Block-stackingVideo-CraftBench
Success Rate (Human)15.3
14
Sequential Paper FoldingVideo-CraftBench
Step 1 Success Rate23.1
14
Video GenerationVideo-CraftBench
SSIM58.8
14
Class-Conditional Video GenerationBenchmark 17x256x256 resolution (test)
gFVD254.8
9
Red buttonVP2 benchmark
Mean Success Rate92.22
7
open drawerVP2 benchmark
Mean Success Rate37.5
7
Video PredictionRoboNet
FVD63.2
7
Blue buttonVP2 benchmark
Mean Success Rate95.56
7
Green buttonVP2 benchmark
Mean Success Rate82.5
7
Showing 10 of 15 rows

Other info

Code

Follow for update