Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Yume-1.5: A Text-Controlled Interactive World Generation Model

About

Recent approaches have demonstrated the promise of using diffusion models to generate interactive and explorable worlds. However, most of these methods face critical challenges such as excessively large parameter sizes, reliance on lengthy inference steps, and rapidly growing historical context, which severely limit real-time performance and lack text-controlled generation capabilities. To address these challenges, we propose \method, a novel framework designed to generate realistic, interactive, and continuous worlds from a single image or text prompt. \method achieves this through a carefully designed framework that supports keyboard-based exploration of the generated worlds. The framework comprises three core components: (1) a long-video generation framework integrating unified context compression with linear attention; (2) a real-time streaming acceleration strategy powered by bidirectional attention distillation and an enhanced text embedding scheme; (3) a text-controlled method for generating world events. We have provided the codebase in the supplementary material.

Xiaofeng Mao, Zhen Li, Chuanhao Li, Xiaojie Xu, Kaining Ying, Tong He, Jiangmiao Pang, Yu Qiao, Kaipeng Zhang• 2025

Related benchmarks

TaskDatasetResultRank
Video GenerationVBench
Quality Score69.69
126
Video GenerationVBench (test)--
48
Interactive Video World Model EvaluationRealWM120K VBench (val)
Latency (s)19
9
World SimulationBusan-City-Bench
FID54.82
8
World SimulationAnn-Arbor-City-Bench
FID85.62
8
Long Video GenerationDL3DV-Evaluation (test)
SSIM0.342
8
Long Video GenerationTanks&Temples (test)
SSIM34.8
8
3D Scene GenerationTanks&Temples (test)
LPIPS (Perceptual)0.575
7
3D Scene GenerationDL3DV (test)
LPIPS (P)0.598
7
Interactive World ModelingGeneral Game World Modeling
Resolution480
6
Showing 10 of 12 rows

Other info

GitHub

Follow for update