InstructVLA: Vision-Language-Action Instruction Tuning from Understanding to Manipulation

About

To operate effectively in the real world, robots should integrate multimodal reasoning with precise action generation. However, existing vision-language-action (VLA) models often sacrifice one for the other, narrow their abilities to task-specific manipulation data, and suffer catastrophic forgetting of pre-trained vision-language capabilities. To bridge this gap, we introduce InstructVLA, an end-to-end VLA model that preserves the flexible reasoning of large vision-language models (VLMs) while delivering leading manipulation performance with the help of embodied reasoning. InstructVLA introduces a novel training paradigm, Vision-Language-Action Instruction Tuning (VLA-IT), which employs multimodal training with mixture-of-experts adaptation to jointly optimize embodied reasoning and action generation on both standard VLM corpora and a curated 650K-sample VLA-IT dataset. On in-domain SimplerEnv tasks, InstructVLA achieves 33% improvement over SpatialVLA. To evaluate generalization, we introduce SimplerEnv-Instruct, an 80-task benchmark requiring closed-loop control and high-level instruction understanding, where it outperforms a fine-tuned OpenVLA by 96% and an action expert aided by GPT-4o by 29%. Additionally, InstructVLA surpasses baseline VLMs on multimodal tasks and exhibits inference-time scaling by leveraging textual reasoning to boost manipulation performance in both simulated and real-world settings. These results demonstrate InstructVLA's potential for bridging intuitive and steerable human-robot interaction with efficient policy learning.

Shuai Yang, Hao Li, Bin Wang, Yilun Chen, Yang Tian, Tai Wang, Hanqing Wang, Feng Zhao, Yiyi Liao, Jiangmiao Pang• 2025

Related benchmarks

Task	Dataset	Result
Multimodal Understanding	MMBench	Accuracy76.3	847
Multimodal Understanding	MM-Vet	MM-Vet Score54	631
Robotic Manipulation	LIBERO	Spatial Success Rate97.3	527
Visual Question Answering	ChartQA	Accuracy82.9	519
Multimodal Understanding	MMStar	Accuracy56.2	407
Visual Question Answering	AI2D	Accuracy79.1	317
Visual Question Answering	DocVQA	Accuracy86	205
Multimodal Understanding	MMMU (val)	--	199
Visual Question Answering	InfoVQA	Accuracy63.7	195
Multimodal Understanding	MME Perception	--	59

Showing 10 of 27 rows

Other info

Follow for update

@wizwand_team Discord