VITA: Vision-to-Action Flow Matching Policy

About

Conventional flow matching and diffusion-based policies sample via iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA, VIsion-To-Action policy, a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need for visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent action space collapse during end-to-end training, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equation) solving steps. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. VITA achieves 1.5x-2x faster inference compared to conventional methods with conditioning modules, while outperforming or matching state-of-the-art policies. Project page: https://ucd-dare.github.io/VITA/.

Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani• 2025

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	Robomimic Can	Success Rate100	30
Robotic Arm Manipulation	MetaWorld Very Hard	Success Rate62	21
Dexterous Hand Control	Adroit	Overall Avg Success Rate77	19
Robotic Arm Manipulation	MetaWorld Easy	Success Rate85	15
close box	RLBench	Success Rate88	14
PickCube	RoboVerse Simulated	Success Rate86	13
Dexterous Hand Manipulation	DexArt	Success Rate55	12
Stack Cube	ManiSkill	Success Rate80	11
close box	RoboVerse	Success Rate94	11
Stack Cube	RoboVerse	Success Rate86	11

Showing 10 of 29 rows

Other info

Follow for update

@wizwand_team Discord