Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

From Noise to Intent: Anchoring Generative VLA Policies with Residual Bridges

About

Bridging high-level semantic understanding with low-level physical control remains a persistent challenge in embodied intelligence, stemming from the fundamental spatiotemporal scale mismatch between cognition and action. Existing generative VLA policies typically adopt a "Generation-from-Noise" paradigm, which disregards this disparity, leading to representation inefficiency and weak condition alignment during optimization. In this work, we propose ResVLA, an architecture that shifts the paradigm to "Refinement-from-Intent." Recognizing that robotic motion naturally decomposes into global intent and local dynamics, ResVLA utilizes spectral analysis to decouple control into a deterministic low-frequency anchor and a stochastic high-frequency residual. By anchoring the generative process on the predicted intent, our model focuses strictly on refining local dynamics via a residual diffusion bridge. Extensive simulation experiments show that ResVLA achieves competitive performance, strong robustness to language and robot embodiment perturbations, and faster convergence than standard generative baselines. It also demonstrates strong performance in real-world robot experiments.

Yiming Zhong, Yaoyu He, Zemin Yang, Pengfei Tian, Yifan Huang, Qingqiu Huang, Xinge Zhu, Yuexin Ma• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Object Achievement98.6
957
Robotic ManipulationLIBERO-Plus
Language Understanding Score88.5
249
Robot ManipulationSIMPLER WidowX + Bridge Setup
Spoon Success Rate26
22
Robot ManipulationSimplerEnv Google Robot
Pick Coke Can Success Rate91
13
Multi-stage robot manipulationALOHA real-world platform
Pick Cup Success Rate60
2
Robot ManipulationLIBERO-Plus In-domain Adaptation--
1
Showing 6 of 6 rows

Other info

Follow for update