Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Asynchronous Temporal Fields for Action Recognition

About

Actions are more than just movements and trajectories: we cook to eat and we hold a cup to drink from it. A thorough understanding of videos requires going beyond appearance modeling and necessitates reasoning about the sequence of activities, as well as the higher-level constructs such as intentions. But how do we model and reason about these? We propose a fully-connected temporal CRF model for reasoning over various aspects of activities that includes objects, actions, and intentions, where the potentials are predicted by a deep network. End-to-end training of such structured models is a challenging endeavor: For inference and learning we need to construct mini-batches consisting of whole videos, leading to mini-batches with only a few videos. This causes high-correlation between data points leading to breakdown of the backprop algorithm. To address this challenge, we present an asynchronous variational inference method that allows efficient end-to-end training. Our method achieves a classification mAP of 22.4% on the Charades benchmark, outperforming the state-of-the-art (17.2% mAP), and offers equal gains on the task of temporal localization.

Gunnar A. Sigurdsson, Santosh Divvala, Ali Farhadi, Abhinav Gupta• 2016

Related benchmarks

TaskDatasetResultRank
Action RecognitionCharades (val)
mAP22.4
69
Action RecognitionCharades
mAP0.224
64
Action RecognitionCharades (test)
mAP0.224
53
Activity DetectionCharades localize v1
mAP12.8
52
Action RecognitionCharades v1 (test)
mAP22.4
52
Video ClassificationCharades
mAP22.4
38
Activity DetectionCharades (test)
mAP9.6
19
Multi-label Temporal Action LocalizationCharades per-frame 51
mAP12.8
14
Temporal LocalizationCharades
mAP12.8
12
Multi-label video classificationCharades (val)
mAP22.4
12
Showing 10 of 14 rows

Other info

Code

Follow for update