Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

About

Temporal reasoning is an important aspect of video analysis. 3D CNN shows good performance by exploring spatial-temporal features jointly in an unconstrained way, but it also increases the computational cost a lot. Previous works try to reduce the complexity by decoupling the spatial and temporal filters. In this paper, we propose a novel decomposition method that decomposes the feature channels into spatial and temporal groups in parallel. This decomposition can make two groups focus on static and dynamic cues separately. We call this grouped spatial-temporal aggregation (GST). This decomposition is more parameter-efficient and enables us to quantitatively analyze the contributions of spatial and temporal features in different layers. We verify our model on several action recognition tasks that require temporal reasoning and show its effectiveness.

Chenxu Luo, Alan Yuille• 2019

Related benchmarks

Task	Dataset	Result
Action Recognition	Something-Something v2 (val)	Top-1 Accuracy62.6	565
Action Recognition	Something-Something v2	Top-1 Accuracy62.6	363
Action Recognition	Something-Something v2 (test)	Top-1 Acc62.6	333
Action Recognition	Something-something v1 (val)	Top-1 Acc48.6	257
Action Recognition	Something-something v1 (test)	Top-1 Accuracy48.6	189
Action Recognition	Something-Something v2 (test val)	Top-1 Accuracy63.1	187
Video Classification	Something-Something v2 (test)	Top-1 Acc0.626	169
Action Recognition	Something-Something V1	Top-1 Acc48.6	162
Video Action Classification	Something-Something v2	Top-1 Acc62.6	145
Video Classification	Something-something v1 (test)	Top-1 Accuracy48.6	115

Showing 10 of 28 rows

Other info

Follow for update

@wizwand_team Discord