Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models

About

Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6\% while reducing token usage by 78\%, and demonstrate strong generalization across diverse decoders and benchmarks.

Peiju Liu, Jinming Liu, Xipeng Qiu, Xuanjing Huang• 2026

Related benchmarks

TaskDatasetResultRank
Robotic ManipulationLIBERO--
314
Robot ManipulationSimplerEnv Google Robot tasks Variant Aggregation
Average Success Rate39.44
67
Robotic ManipulationSIMPLER Google Robot VA
Pick Up Coke Can Success Rate89.7
35
Robotic ManipulationSIMPLER Visual Matching
Average Success Rate78.1
26
Showing 4 of 4 rows

Other info

Follow for update