Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

About

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .

Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S. Rawat, Yunhao Ge, Yuzhang Shang• 2026

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement95.2
700
Robotic ManipulationLIBERO
Spatial Success Rate98.7
314
Dual-arm manipulationRoboTwin Short Horizon Tasks 100-130 Steps 2.0
Lift Pot Success Rate64.8
6
Dual-arm manipulationRoboTwin Medium Horizon Tasks 150-230 Steps 2.0
Move Can Pot61
6
Dual-arm manipulationRoboTwin Long & Extra Long Horizon Tasks 280-650 Steps 2.0
Handover Block52.8
6
Showing 5 of 5 rows

Other info

Follow for update