iVideoGPT: Interactive VideoGPTs are Scalable World Models

About

World models empower model-based agents to interactively explore, reason, and plan within imagined environments for real-world decision-making. However, the high demand for interactivity poses challenges in harnessing recent advancements in video generative models for developing world models at scale. This work introduces Interactive VideoGPT (iVideoGPT), a scalable autoregressive transformer framework that integrates multimodal signals--visual observations, actions, and rewards--into a sequence of tokens, facilitating an interactive experience of agents via next-token prediction. iVideoGPT features a novel compressive tokenization technique that efficiently discretizes high-dimensional visual observations. Leveraging its scalable architecture, we are able to pre-train iVideoGPT on millions of human and robotic manipulation trajectories, establishing a versatile foundation that is adaptable to serve as interactive world models for a wide range of downstream tasks. These include action-conditioned video prediction, visual planning, and model-based reinforcement learning, where iVideoGPT achieves competitive performance compared with state-of-the-art methods. Our work advances the development of interactive general world models, bridging the gap between generative video models and practical model-based reinforcement learning applications. Code and pre-trained models are available at https://thuml.github.io/iVideoGPT.

Jialong Wu, Shaofeng Yin, Ningya Feng, Xu He, Dong Li, Jianye Hao, Mingsheng Long• 2024

Related benchmarks

Task	Dataset	Result
Video Prediction	BAIR Robot Pushing	FVD60.8	38
Robot World Modeling	RobotWorldBench	Instruction Score2.6	18
Block-stacking	Video-CraftBench	Success Rate (Human)15.3	14
Sequential Paper Folding	Video-CraftBench	Step 1 Success Rate23.1	14
Video Generation	Video-CraftBench	SSIM58.8	14
Video Prediction	RoboNet	FVD63.2	13
Class-Conditional Video Generation	Benchmark 17x256x256 resolution (test)	gFVD254.8	9
Red button	VP2 benchmark	Mean Success Rate92.22	7
open drawer	VP2 benchmark	Mean Success Rate37.5	7
Blue button	VP2 benchmark	Mean Success Rate95.56	7

Showing 10 of 25 rows

Other info

Code

Follow for update

@wizwand_team Discord