Temporal-Relational CrossTransformers for Few-Shot Action Recognition
About
We propose a novel approach to few-shot action recognition, finding temporally-corresponding frame tuples between the query and videos in the support set. Distinct from previous few-shot works, we construct class prototypes using the CrossTransformer attention mechanism to observe relevant sub-sequences of all support videos, rather than using class averages or single best matches. Video representations are formed from ordered tuples of varying numbers of frames, which allows sub-sequences of actions at different speeds and temporal offsets to be compared. Our proposed Temporal-Relational CrossTransformers (TRX) achieve state-of-the-art results on few-shot splits of Kinetics, Something-Something V2 (SSv2), HMDB51 and UCF101. Importantly, our method outperforms prior work on SSv2 by a wide margin (12%) due to the its ability to model temporal relations. A detailed ablation showcases the importance of matching to multiple support set videos and learning higher-order relational CrossTransformers.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Something-Something v2 | Top-1 Accuracy64.6 | 341 | |
| Action Recognition | Kinetics | Accuracy (5-shot)85.9 | 47 | |
| Few-shot Action Recognition | Kinetics (meta-test) | Accuracy85.9 | 46 | |
| Action Recognition | SSv2 Few-shot | Top-1 Acc (5-way 1-shot)45.1 | 42 | |
| Few-shot Action Recognition | SS Full meta v2 (test) | Accuracy64.6 | 38 | |
| Video Action Recognition | UCF101 5-way 5-shot | Accuracy97.1 | 28 | |
| Video Action Recognition | HMDB51 5-way 5-shot | Accuracy79.7 | 28 | |
| Action Recognition | SSv2 Small | Top-1 Acc (1-shot)36 | 26 | |
| Few-shot Video Classification | Something-Something V2 (Small) | Accuracy59.1 | 24 | |
| Video Action Recognition | Kinetics | Accuracy85.9 | 23 |