Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Coarse-Fine Networks for Temporal Activity Detection in Videos

About

In this paper, we introduce Coarse-Fine Networks, a two-stream architecture which benefits from different abstractions of temporal resolution to learn better video representations for long-term motion. Traditional Video models process inputs at one (or few) fixed temporal resolution without any dynamic frame selection. However, we argue that, processing multiple temporal resolutions of the input and doing so dynamically by learning to estimate the importance of each frame can largely improve video representations, specially in the domain of temporal activity localization. To this end, we propose (1) Grid Pool, a learned temporal downsampling layer to extract coarse features, and, (2) Multi-stage Fusion, a spatio-temporal attention mechanism to fuse a fine-grained context with the coarse features. We show that our method outperforms the state-of-the-arts for action detection in public datasets including Charades with a significantly reduced compute and memory footprint. The code is available at https://github.com/kkahatapitiya/Coarse-Fine-Networks

Kumara Kahatapitiya, Michael S. Ryoo• 2021

Related benchmarks

TaskDatasetResultRank
Activity DetectionCharades localize v1
mAP25.1
52
Action DetectionCharades (test)
PAC10.7
27
Activity DetectionCharades (test)
mAP25.1
19
Temporal Activity DetectionCharades v1_localize (val)
mAP25.1
15
Multi-label Temporal Action SegmentationCharades 1.0 (test)
Seg-mAP25.1
14
Action DetectionCharades RGB (test)
mAP0.251
10
Action DetectionCharades--
10
Temporal Action LocalizationCharades (test)
Average mAP6.1
9
Multi-label Temporal Action DetectionCharades 1.0 (test)
Det-mAP6.1
5
Showing 9 of 9 rows

Other info

Code

Follow for update