Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Temporal-Relational CrossTransformers for Few-Shot Action Recognition

About

We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.

Toby Perrett, Alessandro Masullo, Tilo Burghardt, Majid Mirmehdi, Dima Damen• 2021

Related benchmarks

TaskDatasetResultRank
Action RecognitionSomething-Something v2
Top-1 Accuracy64.6
341
Action RecognitionKinetics
Accuracy (5-shot)85.9
47
Few-shot Action RecognitionKinetics (meta-test)
Accuracy85.9
46
Action RecognitionSSv2 Few-shot
Top-1 Acc (5-way 1-shot)45.1
42
Few-shot Action RecognitionSS Full meta v2 (test)
Accuracy64.6
38
Video Action RecognitionUCF101 5-way 5-shot
Accuracy97.1
28
Video Action RecognitionHMDB51 5-way 5-shot
Accuracy79.7
28
Action RecognitionSSv2 Small
Top-1 Acc (1-shot)36
26
Few-shot Video ClassificationSomething-Something V2 (Small)
Accuracy59.1
24
Video Action RecognitionKinetics
Accuracy85.9
23
Showing 10 of 31 rows

Other info

Code

Follow for update