Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Pixel Motion Diffusion is What We Need for Robot Control

About

We present DAWN (Diffusion is All We Need for robot control), a unified diffusion-based framework for language-conditioned robotic manipulation that bridges high-level motion intent and low-level robot action via structured pixel motion representation. In DAWN, both the high-level and low-level controllers are modeled as diffusion processes, yielding a fully trainable, end-to-end system with interpretable intermediate motion abstractions. DAWN achieves state-of-the-art results on the challenging CALVIN benchmark, demonstrating strong multi-task performance, and further validates its effectiveness on MetaWorld. Despite the substantial domain gap between simulation and reality and limited real-world data, we demonstrate reliable real-world transfer with only minimal finetuning, illustrating the practical viability of diffusion-based motion abstractions for robotic control. Our results show the effectiveness of combining diffusion modeling with motion-centric representations as a strong baseline for scalable and robust robot learning. Project page: https://eronguyen.github.io/DAWN/

E-Ro Nguyen, Yichi Zhang, Kanchana Ranasinghe, Xiang Li, Michael S. Ryoo• 2025

Related benchmarks

TaskDatasetResultRank
Long-horizon robotic manipulationCalvin ABC->D
Task 1 Success Rate98
34
Robotic ManipulationMetaWorld
Door Open Success Rate94.7
10
Single lift-and-placeReal-world Apple
Success (i)19
4
Single lift-and-placeReal-world Avocado
Success Count (i)20
4
Single lift-and-placeBanana Real-world
Success Rate (i)17
4
Single lift-and-placeReal-world Grape
Success Rate (i)19
4
Single lift-and-placeReal-world Kiwi
Success Rate (i)17
4
Single lift-and-placeReal-world Orange
Success Rate (i)18
4
Showing 8 of 8 rows

Other info

Follow for update