CAST: Cross-Attention in Space and Time for Video Action Recognition

About

Recognizing human actions in videos requires spatial and temporal understanding. Most existing action recognition models lack a balanced spatio-temporal understanding of videos. In this work, we propose a novel two-stream architecture, called Cross-Attention in Space and Time (CAST), that achieves a balanced spatio-temporal understanding of videos using only RGB input. Our proposed bottleneck cross-attention mechanism enables the spatial and temporal expert models to exchange information and make synergistic predictions, leading to improved performance. We validate the proposed method with extensive experiments on public benchmarks with different characteristics: EPIC-KITCHENS-100, Something-Something-V2, and Kinetics-400. Our method consistently shows favorable performance across these datasets, while the performance of existing methods fluctuates depending on the dataset characteristics.

Dongho Lee, Jongseo Lee, Jinwoo Choi• 2023

Related benchmarks

Task	Dataset	Result
Action Recognition	Kinetics-400	Top-1 Acc85.3	498
Action Recognition	Something-Something v2	Top-1 Accuracy71.6	363
Action Recognition	Something-Something v2 (test val)	Top-1 Accuracy71.6	187
Action Recognition	EPIC-KITCHENS 100 (test)	Top-1 Verb Acc72.5	101
Video Action Recognition	Kinetics 400 (test)	Top-1 Accuracy85.3	44
Action Recognition	EK100	Verb Top-1 Acc72.5	24
Action Recognition	EK100, SSV2, and K400	Overall Harmonic Mean71.6	18
Action Recognition	SSV2 & K400	Harmonic Mean77.9	14

Showing 8 of 8 rows

Other info

Code

Follow for update

@wizwand_team Discord