Beyond Attention Magnitude: Leveraging Inter-layer Rank Consistency for Efficient Vision-Language-Action Models

About

Vision-Language-Action (VLA) models excel in robotic manipulation but suffer from significant inference latency due to processing dense visual tokens. Existing token reduction methods predominantly rely on attention magnitude as a static selection. In this work, we challenge this assumption, revealing that high-attention tokens are task-dependent and can even degrade policy performance. To address this, we introduce \textbf{TIES} (\textbf{T}au-guided \textbf{I}nter-layer \textbf{E}fficient \textbf{S}election), a dynamic framework guided by inter-layer token ranking consistency. By adaptively balancing attention magnitude with ranking consistency, TIES ensures robust token selection without requiring additional training. On the CogACT + SIMPLER benchmark, TIES improves average success rates by 6\% while reducing token usage by 78\%, and demonstrate strong generalization across diverse decoders and benchmarks.

Peiju Liu, Jinming Liu, Xipeng Qiu, Xuanjing Huang• 2026

Related benchmarks

Task	Dataset	Result
Robotic Manipulation	LIBERO	--	527
Robot Manipulation	SimplerEnv Google Robot tasks Variant Aggregation	Average Success Rate39.44	88
Robotic Manipulation	SIMPLER Google Robot VA	Pick Up Coke Can Success Rate89.7	35
Robotic Manipulation	SIMPLER Visual Matching	Average Success Rate78.1	31

Showing 4 of 4 rows

Other info

Follow for update

@wizwand_team Discord