Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

On the Importance of Spatial Relations for Few-shot Action Recognition

About

Deep learning has achieved great success in video recognition, yet still struggles to recognize novel actions when faced with only a few examples. To tackle this challenge, few-shot action recognition methods have been proposed to transfer knowledge from a source dataset to a novel target dataset with only one or a few labeled videos. However, existing methods mainly focus on modeling the temporal relations between the query and support videos while ignoring the spatial relations. In this paper, we find that the spatial misalignment between objects also occurs in videos, notably more common than the temporal inconsistency. We are thus motivated to investigate the importance of spatial relations and propose a more accurate few-shot action recognition method that leverages both spatial and temporal information. Particularly, a novel Spatial Alignment Cross Transformer (SA-CT) which learns to re-adjust the spatial relations and incorporates the temporal information is contributed. Experiments reveal that, even without using any temporal information, the performance of SA-CT is comparable to temporal based methods on 3/4 benchmarks. To further incorporate the temporal information, we propose a simple yet effective Temporal Mixer module. The Temporal Mixer enhances the video representation and improves the performance of the full SA-CT model, achieving very competitive results. In this work, we also exploit large-scale pretrained models for few-shot action recognition, providing useful insights for this research direction.

Yilun Zhang, Yuqian Fu, Xingjun Ma, Lizhe Qi, Jingjing Chen, Zuxuan Wu, Yu-Gang Jiang• 2023

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics
Accuracy (5-shot)91.2
98
Action RecognitionSS Full v2--
58
Action RecognitionUCF101
5-shot Accuracy98
48
Action RecognitionSomething-Something v2
Accuracy (5-shot)69.1
31
Few-shot Action RecognitionUCF101 5-shot
Accuracy98
27
Few-shot Action RecognitionKinetics 5-shot
Accuracy91.2
27
Few-shot Action RecognitionHMDB51 5-shot
Accuracy81.6
27
Action RecognitionHMDB51
5-shot Accuracy81.6
25
Few-shot Action RecognitionSS 5-shot v2
Accuracy (SS 5-shot v2)69.1
25
Few-shot Action RecognitionUCF101 1-shot
Accuracy85.4
23
Showing 10 of 13 rows

Other info

Follow for update