Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Temporal Aggregate Representations for Long-Range Video Understanding

About

Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.

Fadime Sener, Dipika Singhania, Angela Yao• 2020

Related benchmarks

TaskDatasetResultRank
Action SegmentationBreakfast
F1@1078.8
107
Temporal action segmentationBreakfast
Accuracy75.9
96
Action AnticipationBreakfast
MoC Accuracy30.4
64
Action AnticipationEPIC-KITCHENS 100 (test)
Overall Action Top-5 Recall14.7
59
Long-term Action Anticipation50 Salads
MoC Accuracy30.6
56
Action AnticipationEPIC-KITCHENS unseen S2 (test)
Top-1 Acc (Verb)29.5
47
Action AnticipationEpic-Kitchen 55 (val)
Top-1 Acc15.1
33
Action AnticipationEpic-Kitchens-100 (val)
mCR@5 (Overall Verb)27.8
33
Action AnticipationEPIC (val)
Top-5 Action Accuracy40.2
28
Dense anticipation mean over classesBreakfast (test)
Mean Error @ 10% Horizon15.6
28
Showing 10 of 26 rows

Other info

Follow for update