Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Few-shot Action Recognition with Captioning Foundation Models

About

Transferring vision-language knowledge from pretrained multimodal foundation models to various downstream tasks is a promising direction. However, most current few-shot action recognition methods are still limited to a single visual modality input due to the high cost of annotating additional textual descriptions. In this paper, we develop an effective plug-and-play framework called CapFSAR to exploit the knowledge of multimodal models without manually annotating text. To be specific, we first utilize a captioning foundation model (i.e., BLIP) to extract visual features and automatically generate associated captions for input videos. Then, we apply a text encoder to the synthetic captions to obtain representative text embeddings. Finally, a visual-text aggregation module based on Transformer is further designed to incorporate cross-modal spatio-temporal complementary information for reliable few-shot matching. In this way, CapFSAR can benefit from powerful multimodal knowledge of pretrained foundation models, yielding more comprehensive classification in the low-shot regime. Extensive experiments on multiple standard few-shot benchmarks demonstrate that the proposed CapFSAR performs favorably against existing methods and achieves state-of-the-art performance. The code will be made publicly available.

Xiang Wang, Shiwei Zhang, Hangjie Yuan, Yingya Zhang, Changxin Gao, Deli Zhao, Nong Sang• 2023

Related benchmarks

TaskDatasetResultRank
Action RecognitionSSv2 Few-shot
Top-1 Acc (5-way 1-shot)54
42
Video Action RecognitionUCF101 5-way 5-shot
Accuracy97.8
28
Video Action RecognitionHMDB51 5-way 5-shot
Accuracy78.6
28
Few-shot Action RecognitionUCF101 5-way 1-shot
Accuracy93.3
21
Few-shot Action RecognitionHMDB
Accuracy65.2
21
5-way few-shot action recognitionKinetics (test)
1-shot Accuracy84.9
19
5-way few-shot action recognitionSS small v2 (test)
1-shot Accuracy45.9
13
Few-shot Action RecognitionUCF
Accuracy (5-way 1-shot)93.1
9
Showing 8 of 8 rows

Other info

Follow for update