CoT-VLA: Visual Chain-of-Thought Reasoning for Vision-Language-Action Models
About
Vision-language-action models (VLAs) have shown potential in leveraging pretrained vision-language models and diverse robot demonstrations for learning generalizable sensorimotor control. While this paradigm effectively utilizes large-scale data from both robotic and non-robotic sources, current VLAs primarily focus on direct input--output mappings, lacking the intermediate reasoning steps crucial for complex manipulation tasks. As a result, existing VLAs lack temporal planning or reasoning capabilities. In this paper, we introduce a method that incorporates explicit visual chain-of-thought (CoT) reasoning into vision-language-action models (VLAs) by predicting future image frames autoregressively as visual goals before generating a short action sequence to achieve these goals. We introduce CoT-VLA, a state-of-the-art 7B VLA that can understand and generate visual and action tokens. Our experimental results demonstrate that CoT-VLA achieves strong performance, outperforming the state-of-the-art VLA model by 17% in real-world manipulation tasks and 6% in simulation benchmarks. Project website: https://cot-vla.github.io/
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | Goal Achievement87.6 | 494 | |
| Robot Manipulation | LIBERO (test) | Average Success Rate83.9 | 142 | |
| Robotic Manipulation | LIBERO 1.0 (test) | Long87.6 | 30 | |
| Robotic Manipulation | LIBERO v1 (test) | Config 10 Score87.6 | 27 | |
| Multi-task Learning | LIBERO | Object Score91.6 | 18 | |
| Robot Policy Learning | LIBERO | S (Spatial) Rate87.5 | 16 | |
| Robotic Manipulation | LIBERO 50 rollouts per task | Spatial Success0.875 | 10 | |
| Robotic Manipulation | LIBERO Franka | Spatial Achievement Rate87.5 | 9 | |
| Vision-Language Navigation | LH-VLN (test) | SR0.00e+0 | 8 | |
| Multi-task Robot Manipulation | RLBench | Close box95 | 7 |