VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning

About

Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .

Chaoyang Wang, Wenrui Bao, Sicheng Gao, Bingxin Xu, Yu Tian, Yogesh S. Rawat, Yunhao Ge, Yuzhang Shang• 2026

Related benchmarks

Task	Dataset	Result
Robot Manipulation	LIBERO	Object Achievement99	957
Robotic Manipulation	LIBERO	Spatial Success Rate98.7	527
Dual-arm manipulation	RoboTwin Short Horizon Tasks 100-130 Steps 2.0	Lift Pot Success Rate64.8	20
Dual-arm manipulation	RoboTwin Medium Horizon Tasks 150-230 Steps 2.0	Move Can Pot61	20
Dual-arm manipulation	RoboTwin Long & Extra Long Horizon Tasks 280-650 Steps 2.0	Handover Block52.8	20

Showing 5 of 5 rows

Other info

Follow for update

@wizwand_team Discord