Few-shot Action Recognition with Permutation-invariant Attention
About
Many few-shot learning models focus on recognising images. In contrast, we tackle a challenging task of few-shot action recognition from videos. We build on a C3D encoder for spatio-temporal video blocks to capture short-range action patterns. Such encoded blocks are aggregated by permutation-invariant pooling to make our approach robust to varying action lengths and long-range temporal dependencies whose patterns are unlikely to repeat even in clips of the same class. Subsequently, the pooled representations are combined into simple relation descriptors which encode so-called query and support clips. Finally, relation descriptors are fed to the comparator with the goal of similarity learning between query and support clips. Importantly, to re-weight block contributions during pooling, we exploit spatial and temporal attention modules and self-supervision. In naturalistic clips (of the same class) there exists a temporal distribution shift--the locations of discriminative temporal action hotspots vary. Thus, we permute blocks of a clip and align the resulting attention regions with similarly permuted attention regions of non-permuted clip to train the attention mechanism invariant to block (and thus long-term hotspot) permutations. Our method outperforms the state of the art on the HMDB51, UCF101, miniMIT datasets.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Recognition | HMDB51 | Accuracy60.6 | 89 | |
| Action Recognition | Kinetics | Accuracy (5-shot)82.4 | 47 | |
| Few-shot Action Recognition | Kinetics (meta-test) | Accuracy82.4 | 46 | |
| Video Recognition | Kinetics (test) | Accuracy82.4 | 42 | |
| Video Action Recognition | HMDB51 5-way 5-shot | Accuracy60.6 | 28 | |
| Video Action Recognition | UCF101 5-way 5-shot | Accuracy83.1 | 28 | |
| Video Action Recognition | Kinetics | Accuracy82.4 | 23 | |
| Few-shot Action Recognition | HMDB51 meta (test) | Accuracy60.6 | 21 | |
| Few-shot Action Recognition | HMDB | Accuracy45.5 | 21 | |
| Few-shot Action Recognition | UCF101 5-way 1-shot | Accuracy66.3 | 21 |