VLA-Thinker: Boosting Vision-Language-Action Models through Thinking-with-Image Reasoning
About
Vision-Language-Action (VLA) models have shown promising capabilities for embodied intelligence, but most existing approaches rely on text-based chain-of-thought reasoning where visual inputs are treated as static context. This limits the ability of the model to actively revisit the environment and resolve ambiguities during long-horizon tasks. We propose VLA-Thinker, a thinking-with-image reasoning framework that models perception as a dynamically invocable reasoning action. To train such a system, we introduce a two-stage training pipeline consisting of (1) an SFT cold-start phase with curated visual Chain-of-Thought data to activate structured reasoning and tool-use behaviors, and (2) GRPO-based reinforcement learning to align complete reasoning-action trajectories with task-level success. Extensive experiments on LIBERO and RoboTwin 2.0 benchmarks demonstrate that VLA-Thinker significantly improves manipulation performance, achieving 97.5% success rate on LIBERO and strong gains across long-horizon robotic tasks. Project and Codes: https://cywang735.github.io/VLA-Thinker/ .
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Goal Achievement95.2 | 700 | |
| Robotic Manipulation | LIBERO | Spatial Success Rate98.7 | 314 | |
| Dual-arm manipulation | RoboTwin Short Horizon Tasks 100-130 Steps 2.0 | Lift Pot Success Rate64.8 | 6 | |
| Dual-arm manipulation | RoboTwin Medium Horizon Tasks 150-230 Steps 2.0 | Move Can Pot61 | 6 | |
| Dual-arm manipulation | RoboTwin Long & Extra Long Horizon Tasks 280-650 Steps 2.0 | Handover Block52.8 | 6 |