Temporal Aggregate Representations for Long-Range Video Understanding
About
Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.
Fadime Sener, Dipika Singhania, Angela Yao• 2020
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Segmentation | Breakfast | F1@1078.8 | 107 | |
| Temporal action segmentation | Breakfast | Accuracy75.9 | 96 | |
| Action Anticipation | Breakfast | MoC Accuracy30.4 | 64 | |
| Action Anticipation | EPIC-KITCHENS 100 (test) | Overall Action Top-5 Recall14.7 | 59 | |
| Long-term Action Anticipation | 50 Salads | MoC Accuracy30.6 | 56 | |
| Action Anticipation | EPIC-KITCHENS unseen S2 (test) | Top-1 Acc (Verb)29.5 | 47 | |
| Action Anticipation | Epic-Kitchen 55 (val) | Top-1 Acc15.1 | 33 | |
| Action Anticipation | Epic-Kitchens-100 (val) | mCR@5 (Overall Verb)27.8 | 33 | |
| Action Anticipation | EPIC (val) | Top-5 Action Accuracy40.2 | 28 | |
| Dense anticipation mean over classes | Breakfast (test) | Mean Error @ 10% Horizon15.6 | 28 |
Showing 10 of 26 rows