Pixel Motion Diffusion is What We Need for Robot Control

About

We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/

E-Ro Nguyen, Yichi Zhang, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo• 2025

Related benchmarks

Task	Dataset	Result
Long-horizon robotic manipulation	Calvin ABC->D	Average Trajectory Length4	48
Robotic Manipulation	MetaWorld	Door Open Success Rate94.7	10
Single lift-and-place	Real-world Apple	Success (i)19	4
Single lift-and-place	Real-world Avocado	Success Count (i)20	4
Single lift-and-place	Banana Real-world	Success Rate (i)17	4
Single lift-and-place	Real-world Grape	Success Rate (i)19	4
Single lift-and-place	Real-world Kiwi	Success Rate (i)17	4
Single lift-and-place	Real-world Orange	Success Rate (i)18	4

Showing 8 of 8 rows

Other info

Follow for update

@wizwand_team Discord