Vidarc: Embodied Video Diffusion Model for Closed-loop Control

About

Robotic arm manipulation in data-scarce settings is a highly challenging task due to the complex embodiment dynamics and diverse contexts. Recent video-based approaches have shown great promise in capturing and transferring the temporal and physical interactions by pre-training on Internet-scale video data. However, such methods are often not optimized for the embodiment-specific closed-loop control, typically suffering from high latency and insufficient grounding. In this paper, we present Vidarc (Video Diffusion for Action Reasoning and Closed-loop Control), a novel autoregressive embodied video diffusion approach augmented by a masked inverse dynamics model. By grounding video predictions with action-relevant masks and incorporating real-time feedback through cached autoregressive generation, Vidarc achieves fast, accurate closed-loop control. Pre-trained on one million cross-embodiment episodes, Vidarc surpasses state-of-the-art baselines, achieving at least a 15% higher success rate in real-world deployment and a 91% reduction in latency. We also highlight its robust generalization and error correction capabilities across previously unseen robotic platforms.

Yao Feng, Chendong Xiang, Xinyi Mao, Hengkai Tan, Zuyue Zhang, Shuhe Huang, Kaiwen Zheng, Haitian Liu, Hang Su, Jun Zhu• 2025

Related benchmarks

Task	Dataset	Result
Video Generation	WorldArena	Interaction Quality52	14
Robotic Manipulation	RoboTwin	Success Rate80.7	13
Robot Manipulation	Real-world Robot Manipulation Average	Success Rate56	10
Robot Manipulation	Real-world scenarios Unseen	Success Rate0.56	3
Robot Manipulation	Real-world scenarios (Seen)	Success Rate72	3
Robot Manipulation	Real-world scenarios Dynamic	Success Rate40	3

Showing 6 of 6 rows

Other info

Follow for update

@wizwand_team Discord