Share your thoughts, 1 month free Claude Pro on usSee more
WorkDL logo mark

Temporal Aggregate Representations for Long-Range Video Understanding

About

Future prediction, especially in long-range videos, requires reasoning from current and past observations. In this work, we address questions of temporal extent, scaling, and level of semantic abstraction with a flexible multi-granular temporal aggregation framework. We show that it is possible to achieve state of the art in both next action and dense anticipation with simple techniques such as max-pooling and attention. To demonstrate the anticipation capabilities of our model, we conduct experiments on Breakfast, 50Salads, and EPIC-Kitchens datasets, where we achieve state-of-the-art results. With minimal modifications, our model can also be extended for video segmentation and action recognition.

Fadime Sener, Dipika Singhania, Angela Yao• 2020

Related benchmarks

TaskDatasetResultRank
Action SegmentationBreakfast
Acc74
116
Temporal action segmentationBreakfast
Accuracy75.9
102
Action AnticipationEPIC-KITCHENS 100 (test)
Overall Action Top-5 Recall14.7
70
Action AnticipationBreakfast
MoC Accuracy30.4
64
Long-term Action Anticipation50 Salads
MoC Accuracy30.6
56
Action AnticipationEpic-Kitchen 55 (val)
Top-1 Acc15.1
48
Action AnticipationEpic-Kitchens-100 (val)
mCR@5 (Overall Verb)27.8
48
Action AnticipationEPIC-KITCHENS unseen S2 (test)
Top-1 Acc (Verb)29.5
47
Action AnticipationEPIC (val)
Top-5 Action Accuracy40.2
28
Dense anticipation mean over classesBreakfast (test)
Mean Error @ 10% Horizon15.6
28
Showing 10 of 26 rows

Other info

Follow for update