VA-AR: Learning Velocity-Aware Action Representations with Mixture of Window Attention

About

Action recognition is a crucial task in artificial intelligence, with significant implications across various domains. We initially perform a comprehensive analysis of seven prominent action recognition methods across five widely-used datasets. This analysis reveals a critical, yet previously overlooked, observation: as the velocity of actions increases, the performance of these methods variably declines, undermining their robustness. This decline in performance poses significant challenges for their application in real-world scenarios. Building on these findings, we introduce the Velocity-Aware Action Recognition (VA-AR) framework to obtain robust action representations across different velocities. Our principal insight is that rapid actions (e.g., the giant circle backward in uneven bars or a smash in badminton) occur within short time intervals, necessitating smaller temporal attention windows to accurately capture intricate changes. Conversely, slower actions (e.g., drinking water or wiping face) require larger windows to effectively encompass the broader context. VA-AR employs a Mixture of Window Attention (MoWA) strategy, dynamically adjusting its attention window size based on the action's velocity. This adjustment enables VA-AR to obtain a velocity-aware representation, thereby enhancing the accuracy of action recognition. Extensive experiments confirm that VA-AR achieves state-of-the-art performance on the same five datasets, demonstrating VA-AR's effectiveness across a broad spectrum of action recognition scenarios.

Jiangning Wei, Lixiong Qin, Bo Yu, Tianjian Zou, Chuhan Yan, Dandan Xiao, Yang Yu, Lan Yang, Ke Li, Jun Liu• 2025

Related benchmarks

Task	Dataset	Result
Action Recognition	NTU RGB+D 60 (Cross-View)	Accuracy97.2	601
Action Recognition	NTU RGB+D 60 (X-sub)	Accuracy93.1	496
Action Recognition	NTU-60 (xsub)	Accuracy93.1	271
Action Recognition	NTU-120 (cross-subject (xsub))	Accuracy90.3	239
Action Recognition	NTU 120 (Cross-Setup)	Accuracy91.5	231
Skeleton-based Action Recognition	NTU RGB+D 120 (X-set)	Top-1 Accuracy91.5	184
Action Recognition	NTU-60 (xview)	Accuracy97.2	165
Skeleton-based Action Recognition	NTU-RGB+D 120 (X-Sub)	Accuracy90.3	79
Action Recognition	FineGYM	Accuracy92.8	29

Showing 9 of 9 rows

Other info

Follow for update

@wizwand_team Discord