Fast-ThinkAct: Efficient Vision-Language-Action Reasoning via Verbalizable Latent Planning
About
Vision-Language-Action (VLA) tasks require reasoning over complex visual scenes and executing adaptive actions in dynamic environments. While recent studies on reasoning VLAs show that explicit chain-of-thought (CoT) can improve generalization, they suffer from high inference latency due to lengthy reasoning traces. We propose Fast-ThinkAct, an efficient reasoning framework that achieves compact yet performant planning through verbalizable latent reasoning. Fast-ThinkAct learns to reason efficiently with latent CoTs by distilling from a teacher, driven by a preference-guided objective to align manipulation trajectories that transfers both linguistic and visual planning capabilities for embodied control. This enables reasoning-enhanced policy learning that effectively connects compact reasoning to action execution. Extensive experiments across diverse embodied manipulation and reasoning benchmarks demonstrate that Fast-ThinkAct achieves strong performance with up to 89.3% reduced inference latency over state-of-the-art reasoning VLAs, while maintaining effective long-horizon planning, few-shot adaptation, and failure recovery.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Robot Manipulation | LIBERO | -- | 494 | |
| Robot Manipulation | LIBERO (test) | Average Success Rate89.7 | 142 | |
| Egocentric daily-task planning | EgoPlanBench2 | Overall Success Rate47.5 | 26 | |
| Embodied Question Answering | RoboVQA | BLEU-170.4 | 13 | |
| Long-horizon reasoning for robotic manipulation | RoboVQA | B-1 Score70.1 | 10 | |
| Zero-shot understanding of embodied scenes | OpenEQA | Score51.2 | 10 | |
| Embodied Question Answering | OpenEQA | Score59 | 8 | |
| Bimanual Robot Manipulation | RoboTwin Easy Setting 2.0 (test) | Click Alarm70 | 6 | |
| Bimanual Robot Manipulation | RoboTwin Hard Setting 2.0 (test) | Click Alarm Success Count17 | 6 | |
| Robot Manipulation | SimplerEnv Google | Success Rate68.7 | 5 |