Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Yume: An Interactive World Generation Model

About

Yume aims to use images, text, or videos to create an interactive, realistic, and dynamic world, which allows exploration and control using peripheral devices or neural signals. In this report, we present a preview version of \method, which creates a dynamic world from an input image and allows exploration of the world using keyboard actions. To achieve this high-fidelity and interactive video world generation, we introduce a well-designed framework, which consists of four main components, including camera motion quantization, video generation architecture, advanced sampler, and model acceleration. First, we quantize camera motions for stable training and user-friendly interaction using keyboard inputs. Then, we introduce the Masked Video Diffusion Transformer~(MVDT) with a memory module for infinite video generation in an autoregressive manner. After that, training-free Anti-Artifact Mechanism (AAM) and Time Travel Sampling based on Stochastic Differential Equations (TTS-SDE) are introduced to the sampler for better visual quality and more precise control. Moreover, we investigate model acceleration by synergistic optimization of adversarial distillation and caching mechanisms. We use the high-quality world exploration dataset \sekai to train \method, and it achieves remarkable results in diverse scenes and applications. All data, codebase, and model weights are available on https://github.com/stdstu12/YUME. Yume will update monthly to achieve its original goal. Project page: https://stdstu12.github.io/YUME-Project/.

Xiaofeng Mao, Shaoheng Lin, Zhen Li, Chuanhao Li, Wenshuo Peng, Tong He, Jiangmiao Pang, Mingmin Chi, Yu Qiao, Kaipeng Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Visual generation2D trajectory dataset
LPIPS0.571
16
Interactive Video World Model EvaluationRealWM120K VBench (val)
Latency (s)1.92e+3
9
Action-Conditioned Video GenerationAstra-Bench
Rotation Error2.2
5
Long-horizon 3D Consistency200-frame closed-loop camera trajectories
Sharpness95
5
Action-driven video generationAstra-Bench (held-out samples)
Instruction Following65.2
4
Image-to-Video GenerationYume-Bench
Image Fidelity (IF)65.7
4
Interactive Video GenerationCityWalker (unseen scenes)
Instruction Following61.9
4
Joint camera and object motion controlVerseControl4D 1.0 (test)
Overall Score85.47
4
Interactive Video GenerationWorldCam-50h
RPEtrans0.111
4
Interactive Video GenerationHuman Evaluation Interactive Gaming
Action Controllability2.47
4
Showing 10 of 12 rows

Other info

Follow for update