Dual-Stream Diffusion for World-Model Augmented Vision-Language-Action Model

About

Augmenting vision-language-action models (VLAs) with world models is promising for robotic policy learning but faces challenges in jointly predicting states and actions due to the modality gap. To address this, we propose DUal-STream diffusion (DUST), a world-model augmented VLA framework featuring a multimodal diffusion transformer that maintains separate modality streams while enabling cross-modal knowledge sharing. In addition, DUST utilizes independent noise perturbations and a decoupled flow matching loss to learn cross-modal causal relationships. We further introduce an asynchronous sampling method for action and vision tokens that enhances performance through inference-time scaling. Experimental results on simulated benchmarks like RoboCasa and GR-1 show that DUST achieves up to 6% gains over state-of-the-art VLA and world-modeling baselines, with inference-time scaling providing an additional 2-5% improvement. In real-world tasks using the Franka Research 3, DUST outperforms baselines by 10% in success rate. Finally, we demonstrate that DUST enables effective transfer learning through both pretraining on action-free videos and joint-training with heterogeneous robot and human datasets.

John Won, Kyungmin Lee, Huiwon Jang, Dongyoung Kim, Jinwoo Shin• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO	Spatial Success Rate96.2	570
Long-horizon robotic manipulation	Calvin ABC->D	Average Trajectory Length3.91	48
Robotic Manipulation	RoboCasa Kitchen	Success Rate58.5	22
Kitchen manipulation	RoboCasa 24 kitchen manipulation tasks	Average Success Rate58.5	12
Humanoid tabletop manipulation	GR-1 300 Demos	Success Rate (PnP)35.8	7
Robot Manipulation	RoboCasa 100 demos	PnP Success Rate29.5	7
Robot Manipulation	RoboCasa 300 demos	PnP Success Rate42.3	7
Robot Manipulation	Franka Research 3 Real-world	Average Success Rate59.9	7
Humanoid tabletop manipulation	GR-1 1,000 Demos	PnP Success Rate42.2	5
Robot Manipulation	RoboCasa 1,000 demos	PnP Success Rate48.3	5

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord