Depth Guided Adaptive Meta-Fusion Network for Few-shot Video Recognition
About
Humans can easily recognize actions with only a few examples given, while the existing video recognition models still heavily rely on the large-scale labeled data inputs. This observation has motivated an increasing interest in few-shot video action recognition, which aims at learning new actions with only very few labeled samples. In this paper, we propose a depth guided Adaptive Meta-Fusion Network for few-shot video recognition which is termed as AMeFu-Net. Concretely, we tackle the few-shot recognition problem from three aspects: firstly, we alleviate this extremely data-scarce problem by introducing depth information as a carrier of the scene, which will bring extra visual information to our model; secondly, we fuse the representation of original RGB clips with multiple non-strictly corresponding depth clips sampled by our temporal asynchronization augmentation mechanism, which synthesizes new instances at feature-level; thirdly, a novel Depth Guided Adaptive Instance Normalization (DGAdaIN) fusion module is proposed to fuse the two-stream modalities efficiently. Additionally, to better mimic the few-shot recognition process, our model is trained in the meta-learning way. Extensive experiments on several action recognition benchmarks demonstrate the effectiveness of our model.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Video Recognition | HMDB51 | Accuracy75.5 | 89 | |
| Video Recognition | Kinetics (test) | Accuracy86.8 | 42 | |
| Video Action Recognition | UCF101 5-way 5-shot | Accuracy95.5 | 28 | |
| Video Action Recognition | HMDB51 5-way 5-shot | Accuracy75.5 | 28 | |
| Few-shot Action Recognition | HMDB | Accuracy60.2 | 21 | |
| Few-shot Action Recognition | UCF101 5-way 1-shot | Accuracy85.1 | 21 | |
| Action Recognition | Kinetics standard (meta-test) | Accuracy86.8 | 17 | |
| Video Classification | UCF-101 | Accuracy95.5 | 15 | |
| Activity Recognition | Kinetics 5-shot 5-way (meta-test) | Accuracy86.8 | 6 |