Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Grouped Spatial-Temporal Aggregation for Efficient Action Recognition

About

Temporal reasoning is an important aspect of video analysis. 3D CNN shows good performance by exploring spatial-temporal features jointly in an unconstrained way, but it also increases the computational cost a lot. Previous works try to reduce the complexity by decoupling the spatial and temporal filters. In this paper, we propose a novel decomposition method that decomposes the feature channels into spatial and temporal groups in parallel. This decomposition can make two groups focus on static and dynamic cues separately. We call this grouped spatial-temporal aggregation (GST). This decomposition is more parameter-efficient and enables us to quantitatively analyze the contributions of spatial and temporal features in different layers. We verify our model on several action recognition tasks that require temporal reasoning and show its effectiveness.

Chenxu Luo, Alan Yuille• 2019

Related benchmarks

TaskDatasetResultRank
Action RecognitionSomething-Something v2 (val)
Top-1 Accuracy62.6
535
Action RecognitionSomething-Something v2
Top-1 Accuracy62.6
341
Action RecognitionSomething-Something v2 (test)
Top-1 Acc62.6
333
Action RecognitionSomething-something v1 (val)
Top-1 Acc48.6
257
Action RecognitionSomething-something v1 (test)
Top-1 Accuracy48.6
189
Action RecognitionSomething-Something v2 (test val)
Top-1 Accuracy63.1
187
Video ClassificationSomething-Something v2 (test)
Top-1 Acc0.626
169
Action RecognitionSomething-Something V1
Top-1 Acc48.6
162
Video Action ClassificationSomething-Something v2
Top-1 Acc62.6
139
Video ClassificationSomething-something v1 (test)
Top-1 Accuracy48.6
115
Showing 10 of 27 rows

Other info

Follow for update