VideoGraph: Recognizing Minutes-Long Human Activities in Videos

About

Many human activities take minutes to unfold. To represent them, related works opt for statistical pooling, which neglects the temporal structure. Others opt for convolutional methods, as CNN and Non-Local. While successful in learning temporal concepts, they are short of modeling minutes-long temporal dependencies. We propose VideoGraph, a method to achieve the best of two worlds: represent minutes-long human activities and learn their underlying temporal structure. VideoGraph learns a graph-based representation for human activities. The graph, its nodes and edges are learned entirely from video datasets, making VideoGraph applicable to problems without node-level annotation. The result is improvements over related works on benchmarks: Epic-Kitchen and Breakfast. Besides, we demonstrate that VideoGraph is able to learn the temporal structure of human activities in minutes-long videos.

Noureldien Hussein, Efstratios Gavves, Arnold W.M. Smeulders• 2019

Related benchmarks

Task	Dataset	Result
Action Recognition	Something-Something V1	Top-1 Acc41.6	162
Action Recognition	Charades (test)	mAP0.378	53
Action Recognition	Breakfast	Top-1 Accuracy69.5	28
Single-label activity classification	Breakfast	Accuracy69.5	21
Video Action Recognition	Breakfast	Top-1 Accuracy69.5	18
Human Activity Recognition	Breakfast	Accuracy69.5	14
Long-form Video Classification	Breakfast	Top-1 Accuracy69.5	14
Action Recognition	Breakfast (1357:335)	Accuracy69.5	13
Video Understanding	Breakfast	Top-1 Acc69.5	12
Long-term Action Anticipation	EK-55 (val)	mAP (ALL)22.5	10

Showing 10 of 13 rows

Other info

Follow for update

@wizwand_team Discord