DreamVLA: A Vision-Language-Action Model Dreamed with Comprehensive World Knowledge

About

Recent advances in vision-language-action (VLA) models have shown promise in integrating image generation with action prediction to improve generalization and reasoning in robot manipulation. However, existing methods are limited to challenging image-based forecasting, which suffers from redundant information and lacks comprehensive and critical world knowledge, including dynamic, spatial and semantic information. To address these limitations, we propose DreamVLA, a novel VLA framework that integrates comprehensive world knowledge forecasting to enable inverse dynamics modeling, thereby establishing a perception-prediction-action loop for manipulation tasks. Specifically, DreamVLA introduces a dynamic-region-guided world knowledge prediction, integrated with the spatial and semantic cues, which provide compact yet comprehensive representations for action planning. This design aligns with how humans interact with the world by first forming abstract multimodal reasoning chains before acting. To mitigate interference among the dynamic, spatial and semantic information during training, we adopt a block-wise structured attention mechanism that masks their mutual attention, preventing information leakage and keeping each representation clean and disentangled. Moreover, to model the conditional distribution over future actions, we employ a diffusion-based transformer that disentangles action representations from shared latent features. Extensive experiments on both real-world and simulation environments demonstrate that DreamVLA achieves 76.7% success rate on real robot tasks and 4.44 average length on the CALVIN ABC-D benchmarks.

Wenyao Zhang, Hongsi Liu, Zekun Qi, Yunnan Wang, Xinqiang Yu, Jiazhao Zhang, Runpei Dong, Jiawei He, Fan Lu, He Wang, Zhizheng Zhang, Li Yi, Wenjun Zeng, Xin Jin• 2025

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement94	1025
Robotic Manipulation	LIBERO	Spatial Success Rate97.5	570
Robotic Manipulation	LIBERO-Plus	Language Understanding Score67	414
Robot Manipulation	LIBERO (test)	Average Success Rate92.6	237
Robot Manipulation	LIBERO	Spatial Success Rate97.5	223
Robotic Manipulation	LIBERO	Long-horizon Success Rate89.5	165
Long-horizon robot manipulation	Calvin ABCD→D	Task 1 Completion Rate98.2	140
Robotic Manipulation	Calvin ABCD→D	Avg Length4.44	139
Robotic Manipulation	LIBERO v1 (test)	Average Success Rate92.6	118
Robotic Manipulation	LIBERO	Long Success Rate89.5	108

Showing 10 of 46 rows

Other info

Follow for update

@wizwand_team Discord