Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

VLA-Cache: Efficient Vision-Language-Action Manipulation via Adaptive Token Caching

About

Vision-Language-Action (VLA) models have demonstrated strong multi-modal reasoning capabilities, enabling direct action generation from visual perception and language instructions in an end-to-end manner. However, their substantial computational cost poses a challenge for real-time robotic control, where rapid decision-making is essential. This paper introduces VLA-Cache, a training-free inference acceleration method that reduces computational overhead by adaptively caching and reusing static visual tokens across frames. Exploiting the temporal continuity in robotic manipulation, VLA-Cache identifies minimally changed tokens between adjacent frames and reuses their cached key-value representations, thereby circumventing redundant computations. Additionally, to maintain action precision, VLA-Cache selectively re-computes task-relevant tokens that are environmentally sensitive, ensuring the fidelity of critical visual information. To further optimize efficiency, we introduce a layer adaptive token reusing strategy that dynamically adjusts the reuse ratio based on attention concentration across decoder layers, prioritizing critical tokens for recomputation. Extensive experiments on two simulation platforms (LIBERO and SIMPLER) and a real-world robotic system demonstrate that VLA-Cache achieves up to 1.7x speedup in CUDA latency and a 15% increase in control frequency, with negligible loss on task success rate. The code and videos can be found at our project page: https://vla-cache.github.io.

Siyu Xu, Yunke Wang, Chenghao Xia, Dihao Zhu, Tao Huang, Chang Xu• 2025

Related benchmarks

TaskDatasetResultRank
Robot ManipulationLIBERO
Goal Achievement97.4
700
Robot ManipulationLIBERO (test)
Average Success Rate74.7
184
Robot ManipulationSimplerEnv Google Robot tasks Variant Aggregation
Average Success Rate62.33
67
Robot ManipulationSimplerEnv Google Robot tasks Visual Matching
Pick Coke Can Success Rate92
62
Robotic ManipulationSIMPLER Google Robot VA
Pick Up Coke Can Success Rate91.7
35
Robot ManipulationLIBERO
Spatial Success Rate83.8
30
Robotic ManipulationSIMPLER Visual Matching
Average Success Rate74.4
26
Robot ManipulationLIBERO OpenVLA-OFT
LIBERO Spatial Success0.966
21
Robotic ManipulationReal-world robotic manipulation
Average Success Rate30
6
Robotic ManipulationARX5 Real-World
Task 1 Success Rate75
3
Showing 10 of 10 rows

Other info

Follow for update