Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

DeepThinkVLA: Enhancing Reasoning Capability of Vision-Language-Action Models

About

Enabling Vision-Language-Action (VLA) models to "think before acting" via Chain-of-Thought (CoT) is a promising path to overcoming the data-hungry nature of end-to-end robot policies. However, progress is stalled by a fundamental conflict: existing models use a single autoregressive decoder for both sequential CoT reasoning and high-dimensional, parallelizable robot actions. This architectural mismatch degrades motor control and fails to forge a strong causal link between thought and action. We introduce DeepThinkVLA, which resolves this conflict through a tightly integrated architecture and training strategy. Architecturally, our hybrid-attention decoder generates sequential CoT with causal attention and then switches to bidirectional attention for fast, parallel decoding of action vectors. This design is complemented by a two-stage training pipeline: we first use Supervised Fine-Tuning (SFT) to teach the model foundational reasoning, then apply Reinforcement Learning (RL) with task-success rewards to causally align the full reasoning-action sequence with desired outcomes. This synergy leads to state-of-the-art performance, achieving a 97.0% success rate on the LIBERO benchmark. Our ablations confirm the design's effectiveness: the hybrid architecture alone outperforms standard decoders by 15.5%, and the final RL stage provides a crucial 2% boost to secure top performance.

Cheng Yin, Yankai Lin, Wang Xu, Sikyuen Tam, Xiangrui Zeng, Zhiyuan Liu, Zhouping Yin• 2025

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationLIBERO
Spatial Success Rate96.6
314
Robot ManipulationLIBERO (test)
Average Success Rate97
184
Dual-arm manipulationRoboTwin Short Horizon Tasks 100-130 Steps 2.0
Lift Pot Success Rate62
6
Dual-arm manipulationRoboTwin Medium Horizon Tasks 150-230 Steps 2.0
Move Can Pot52
6
Dual-arm manipulationRoboTwin Long & Extra Long Horizon Tasks 280-650 Steps 2.0
Handover Block43
6
Showing 5 of 5 rows

Other info

Follow for update