Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

VITA: Vision-to-Action Flow Matching Policy

About

Conventional flow matching and diffusion-based policies sample through iterative denoising from standard noise distributions (e.g., Gaussian), and require conditioning modules to repeatedly incorporate visual information during the generative process, incurring substantial time and memory overhead. To reduce the complexity, we develop VITA(VIsion-To-Action policy), a noise-free and conditioning-free flow matching policy learning framework that directly flows from visual representations to latent actions. Since the source of the flow is visually grounded, VITA eliminates the need of visual conditioning during generation. As expected, bridging vision and action is challenging, because actions are lower-dimensional, less structured, and sparser than visual representations; moreover, flow matching requires the source and target to have the same dimensionality. To overcome this, we introduce an action autoencoder that maps raw actions into a structured latent space aligned with visual latents, trained jointly with flow matching. To further prevent latent space collapse, we propose flow latent decoding, which anchors the latent generation process by backpropagating the action reconstruction loss through the flow matching ODE (ordinary differential equation) solving steps. We evaluate VITA on 9 simulation and 5 real-world tasks from ALOHA and Robomimic. VITA achieves 1.5x-2x faster inference compared to conventional methods with conditioning modules, while outperforming or matching state-of-the-art policies. Codes, datasets, and demos are available at our project page: https://ucd-dare.github.io/VITA/.

Dechen Gao, Boqi Zhao, Andrew Lee, Ian Chuang, Hanchu Zhou, Hang Wang, Zhe Zhao, Junshan Zhang, Iman Soltani• 2025

Related benchmarks

TaskDatasetResultRank
Robotic Arm ManipulationMetaWorld Easy
Success Rate85
15
Robotic Arm ManipulationMetaWorld Very Hard
Success Rate62
15
close boxRLBench
Success Rate88
14
Dexterous Hand ControlAdroit
Overall Avg Success Rate77
13
Stack CubeManiSkill
Success Rate80
11
Pick CubeManiSkill
Success Rate88
11
Pick-Place BowlLIBERO
Success Rate92
9
open drawerLIBERO
Success Rate90
9
Dexterous Hand ManipulationDexArt
Success Rate55
6
Robotic Arm ManipulationMetaWorld Hard split
Success Rate48
6
Showing 10 of 11 rows

Other info

Follow for update