VisualThink-VLA: Visual Intermediate Reasoning for Effective and Low-Latency Vision-Language-Action Policies

About

Recent work has begun to equip vision-language-action (VLA) policies with explicit intermediate reasoning. In embodied control, however, textual chain-of-thought is a poor fit: irrelevant or weakly textual information can interfere with action prediction, while autoregressive text decoding adds too much latency for real-time closed-loop execution. We present VISUALTHINK-VLA, a visual intermediate-reasoning framework for accurate, low-latency VLA policies. Our bootstrapping philosophy is to guide action with effective visual thinking: VISUALTHINK-VLA bootstraps action prediction through a compact visual-evidence interface that preserves spatial precision while avoiding decoding overhead. Besides, to further improve performance and efficiency, VISUALTHINK-VLA adopts a tailored selective routing mechanism to learn the visual evidence tokens, enabling low-latency inference while preserving high-capacity specialization. We also introduce VisualEvidence-Kit, a supervision-and-audit resource centered on a VisualEvidence-Agent that constructs a 754.7k VLA instructions VisualEvidence-Set for route supervision and counterfactual faithfulness tests. Across multiple benchmarks and real-robot evaluation, VISUALTHINK-VLA achieves the highest success rate on most benchmarks while reducing the multi-second latency of reasoning-augmented baselines to the sub-second regime. For example, on BridgeData V2, it reduces step latency from 8.377,s with ECoT to 0.367,s, achieving a 22.8 times speedup.

Mingjian Gao, Wenqiao Zhang, Yuqian Yuan, Yang Dai, Binhe Yu, Zheqi Lv, Haoyu Zheng, Jiaqi Zhu, Zhiqi Ge, Zixuan Wan, Siliang Tang, Yueting Zhuang• 2026

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO Long	Success Rate95.87	97
Robotic Manipulation	BridgeData V2	Success Rate89.49	13
Robotic Manipulation	Fractal	Success Rate90.82	13
Robotic Manipulation	UT Austin MUTEX	Success Rate (%)77.26	8
Robotic Manipulation	RoboTurk	Success Rate96.1	8
Robotic Manipulation	LIBERO Spatial	Success Rate96.69	7
Robotic Manipulation	LIBERO Goal	Success Rate97.05	7
Multi-object pick-place	Real-robot tabletop suite (closed-loop evaluation)	Success Rate75.6	3
Relation-sensitive placement	Real-robot tabletop suite (closed-loop evaluation)	Success Rate67.2	3
Two-stage compositional task	Real-robot tabletop suite (closed-loop evaluation)	Success Rate59.2	3

Showing 10 of 12 rows

Other info

GitHub

Follow for update

@wizwand_team Discord