Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

About

Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, Tsung-Yi Lin• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement87.6
494
Robot ManipulationLIBERO (test)
Average Success Rate83.9
142
Robotic ManipulationLIBERO 1.0 (test)
Long87.6
30
Robotic ManipulationLIBERO v1 (test)
Config 10 Score87.6
27
Multi-task LearningLIBERO
Object Score91.6
18
Robot Policy LearningLIBERO
S (Spatial) Rate87.5
16
Robotic ManipulationLIBERO 50 rollouts per task
Spatial Success0.875
10
Robotic ManipulationLIBERO Franka
Spatial Achievement Rate87.5
9
Vision-Language NavigationLH-VLN (test)
SR0.00e+0
8
Multi-task Robot ManipulationRLBench
Close box95
7
Showing 10 of 21 rows

Other info

Code

Follow for update