Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Few-Shot Transformation of Common Actions into Time and Space

About

This paper introduces the task of few-shot common action localization in time and space. Given a few trimmed support videos containing the same but unknown action, we strive for spatio-temporal localization of that action in a long untrimmed query video. We do not require any class labels, interval bounds, or bounding boxes. To address this challenging task, we introduce a novel few-shot transformer architecture with a dedicated encoder-decoder structure optimized for joint commonality learning and localization prediction, without the need for proposals. Experiments on our reorganizations of the AVA and UCF101-24 datasets show the effectiveness of our approach for few-shot common action localization, even when the support videos are noisy. Although we are not specifically designed for common localization in time only, we also compare favorably against the few-shot and one-shot state-of-the-art in this setting. Lastly, we demonstrate that the few-shot transformer is easily extended to common action localization per pixel.

Pengwan Yang, Pascal Mettes, Cees G. M. Snoek• 2021

Related benchmarks

TaskDatasetResultRank
Common action localizationActivityNet Common-Instance 1.3
Video mAP61.9
9
Common action localizationActivityNet Common-Multi-instance 1.3
Video mAP52.3
9
Spatio-temporal common action localizationCommon-AVA reorganized (test)
frame-mAP28.1
8
Spatio-temporal common action localizationCommon-UCF reorganized UCF101-24 (test)
Frame mAP66.7
8
Action Semantic SegmentationA2D (test)
mIoU0.525
7
Showing 5 of 5 rows

Other info

Follow for update