Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

ST4VLA: Spatially Guided Training for Vision-Language-Action Models

About

Large vision-language models (VLMs) excel at multimodal understanding but fall short when extended to embodied tasks, where instructions must be transformed into low-level motor actions. We introduce ST4VLA, a dual-system Vision-Language-Action framework that leverages Spatial Guided Training to align action learning with spatial priors in VLMs. ST4VLA includes two stages: (i) spatial grounding pre-training, which equips the VLM with transferable priors via scalable point, box, and trajectory prediction from both web-scale and robot-specific data, and (ii) spatially guided action post-training, which encourages the model to produce richer spatial priors to guide action generation via spatial prompting. This design preserves spatial grounding during policy learning and promotes consistent optimization across spatial and action objectives. Empirically, ST4VLA achieves substantial improvements over vanilla VLA, with performance increasing from 66.1 -> 84.6 on Google Robot and from 54.7 -> 73.2 on WidowX Robot, establishing new state-of-the-art results on SimplerEnv. It also demonstrates stronger generalization to unseen objects and paraphrased instructions, as well as robustness to long-horizon perturbations in real-world settings. These results highlight scalable spatially guided training as a promising direction for robust, generalizable robot learning. Source code, data and models are released at https://internrobotics.github.io/internvla-m1.github.io/

Jinhui Ye, Fangjing Wang, Ning Gao, Junqiu Yu, Yangkun Zhu, Bin Wang, Jinyu Zhang, Weiyang Jin, Yanwei Fu, Feng Zheng, Yilun Chen, Jiangmiao Pang• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationSimplerEnv WidowX Robot tasks (test)
Success Rate (Spoon)80.2
79
Robot ManipulationSimplerEnv Google Robot tasks Visual Matching
Pick Coke Can Success Rate97.3
62
Robot ManipulationSimplerEnv Google Robot tasks Variant Aggregation
Pick Coke Can Success Rate95.6
44
Robotic ManipulationLIBERO Franka
Spatial Achievement Rate98
9
Pick-&-PlaceReal-World Unseen Instance
Success Rate62
6
Pick-&-PlaceReal-world Robot Pick-and-place In-distribution
Success Rate92
3
Pick-&-PlaceReal-world Robot Pick-and-place Unseen object: Similar distractors
Success Rate49
3
Pick-&-PlaceReal-world Robot Pick-and-place Unseen object: New background
Success Rate63
3
Pick-&-PlaceReal-world Robot Pick-and-place Unseen object position
Success Rate0.52
3
Pick-&-PlaceReal-world Robot Pick-and-place Unseen object orientation
Success Rate72
3
Showing 10 of 13 rows

Other info

Follow for update