Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

About

To operate effectively in the real world, robots must integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize textual reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 30.5% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 92% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.

Shuai Yang, Hao Li, Yilun Chen, Bin Wang, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang• 2025

Related benchmarks

TaskDatasetResultRank
Pick CanSimplerEnv Google Robot embodiment
Success Rate90.8
28
Move NearSimplerEnv Google Robot embodiment
Success Rate77.3
28
Drawer OpeningSimplerEnv Google Robot embodiment (test)
Success Rate60.6
28
General Robot ManipulationSimplerEnv
Average Success Rate56
23
stack blocksSimplerEnv WidowX Robot embodiment
Success Rate20.5
13
Put CarrotSimplerEnv WidowX Robot embodiment
Success Rate29.2
13
Put SpoonSimplerEnv WidowX Robot embodiment
Success Rate4.58e+3
13
Vision-Language-ActionVLA Evaluation Suite
A Score0.631
10
Robotic ManipulationSimplerEnv
Sim. Score56
5
Handover ObjectsSelf-collected Real-world Data Galaxea R1-lite
Success Rate (O1)60
2
Showing 10 of 12 rows

Other info

Follow for update