Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models

About

Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/

Qingqing Zhao, Yao Lu, Moo Jin Kim, Zipeng Fu, Zhuoyang Zhang, Yecheng Wu, Zhaoshuo Li, Qianli Ma, Song Han, Chelsea Finn, Ankur Handa, Ming-Yu Liu, Donglai Xiang, Gordon Wetzstein, Tsung-Yi Lin• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement87.6
700
Robotic ManipulationLIBERO
Spatial Success Rate94.2
314
Robot ManipulationLIBERO (test)
Average Success Rate83.9
184
Robot Policy LearningLIBERO
S (Spatial) Rate87.5
65
Robotic ManipulationLIBERO v1 (test)
Average Success Rate81.1
46
Robotic ManipulationLIBERO (test)
Object Success Rate91.6
45
Robotic ManipulationLIBERO 1.0 (test)
Long87.6
40
Robot ManipulationLIBERO simulation
Average Success Rate81.1
36
Language-conditioned manipulationLIBERO
Spatial Success Rate87.5
18
Multi-task LearningLIBERO
Object Score91.6
18
Showing 10 of 26 rows

Other info

Code

Follow for update