Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

InternVLA-M1: A Spatially Guided Vision-Language-Action Framework for Generalist Robot Policy

About

We introduce InternVLA-M1, a unified framework for spatial grounding and robot control that advances instruction-following robots toward scalable, general-purpose intelligence. Its core idea is spatially guided vision-language-action training, where spatial grounding serves as the critical link between instructions and robot actions. InternVLA-M1 employs a two-stage pipeline: (i) spatial grounding pre-training on over 2.3M spatial reasoning data to determine ``where to act'' by aligning instructions with visual, embodiment-agnostic positions, and (ii) spatially guided action post-training to decide ``how to act'' by generating embodiment-aware actions through plug-and-play spatial prompting. This spatially guided training recipe yields consistent gains: InternVLA-M1 outperforms its variant without spatial guidance by +14.6% on SimplerEnv Google Robot, +17% on WidowX, and +4.3% on LIBERO Franka, while demonstrating stronger spatial reasoning capability in box, point, and trace prediction. To further scale instruction following, we built a simulation engine to collect 244K generalizable pick-and-place episodes, enabling a 6.2% average improvement across 200 tasks and 3K+ objects. In real-world clustered pick-and-place, InternVLA-M1 improved by 7.3%, and with synthetic co-training, achieved +20.6% on unseen objects and novel configurations. Moreover, in long-horizon reasoning-intensive scenarios, it surpassed existing works by over 10%. These results highlight spatially guided training as a unifying principle for scalable and resilient generalist robots. Code and models are available at https://github.com/InternRobotics/InternVLA-M1.

Xinyi Chen, Yilun Chen, Yanwei Fu, Ning Gao, Jiaya Jia, Weiyang Jin, Hao Li, Yao Mu, Jiangmiao Pang, Yu Qiao, Yang Tian, Bin Wang, Bolun Wang, Fangjing Wang, Hanqing Wang, Tai Wang, Ziqin Wang, Xueyuan Wei, Chao Wu, Shuai Yang, Jinhui Ye, Junqiu Yu, Jia Zeng, Jingjing Zhang, Jinyu Zhang, Shi Zhang, Feng Zheng, Bowen Zhou, Yangkun Zhu• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement93.8
700
Dynamic ManipulationDomino
Success Rate (SR)5.4
12
Task 6: Push the cart, grab the grapes, and place on the plateReal-world
Handle Success Rate8
8
Task 4: Grab the can, turn and pour onto plate, push the cart forwardReal-world
Grasp Success20
8
Task 8: Pull out the tray and turn to throw the chip can into the trashReal-world
Grasp Success Rate80
8
Task 5: Put toy into basket, walk to human, hand it overReal-world
Grasp Success Rate20
8
Task 2: Spray the bowl with water, wipe clean, and fold it upReal-world
Grasp Success Rate0.00e+0
8
Task 1: Remove the lid, turn on the faucet, and fill with waterReal-world
Grasp Success Rate0.00e+0
8
Task 3: Pick the bottle, turn around, and pour into cupReal-world
Grasp Success Rate0.00e+0
8
Task 7: Hold the lunch bag and squat down to place on the tableReal-world
Hold Success Rate0.00e+0
8
Showing 10 of 10 rows

Other info

Follow for update