Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Vidarc: Embodied Video Diffusion Model for Closed-loop Control

About

Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.

Yao Feng, Chendong Xiang, Xinyi Mao, Hengkai Tan, Zuyue Zhang, Shuhe Huang, Kaiwen Zheng, Haitian Liu, Hang Su, Jun Zhu• 2025

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationRoboTwin
Success Rate80.7
13
Robot ManipulationReal-world Robot Manipulation Average
Success Rate56
8
Robot ManipulationReal-world scenarios Unseen
Success Rate0.56
3
Robot ManipulationReal-world scenarios (Seen)
Success Rate72
3
Robot ManipulationReal-world scenarios Dynamic
Success Rate40
3
Showing 5 of 5 rows

Other info

Follow for update