VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies
About
Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robotic Manipulation | LIBERO Long | Success Rate95.87 | 91 | |
| Robotic Manipulation | BridgeData V2 | Success Rate89.49 | 8 | |
| Robotic Manipulation | Fractal | Success Rate90.82 | 8 | |
| Robotic Manipulation | UT Austin MUTEX | Success Rate (%)77.26 | 8 | |
| Robotic Manipulation | RoboTurk | Success Rate96.1 | 8 | |
| Robotic Manipulation | LIBERO Spatial | Success Rate96.69 | 7 | |
| Robotic Manipulation | LIBERO Goal | Success Rate97.05 | 7 | |
| Multi-object pick-place | Real-robot tabletop suite (closed-loop evaluation) | Success Rate75.6 | 3 | |
| Relation-sensitive placement | Real-robot tabletop suite (closed-loop evaluation) | Success Rate67.2 | 3 | |
| Two-stage compositional task | Real-robot tabletop suite (closed-loop evaluation) | Success Rate59.2 | 3 |