Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Videos as Space-Time Region Graphs

About

How do humans recognize the action "opening a book" ? We argue that there are two important cues: modeling temporal shape dynamics and modeling functional relationships between humans and objects. In this paper, we propose to represent videos as space-time region graphs which capture these two important cues. Our graph nodes are defined by the object region proposals from different frames in a long range video. These nodes are connected by two types of relations: (i) similarity relations capturing the long range dependencies between correlated objects and (ii) spatial-temporal relations capturing the interactions between nearby objects. We perform reasoning on this graph representation via Graph Convolutional Networks. We achieve state-of-the-art results on both Charades and Something-Something datasets. Especially for Charades, we obtain a huge 4.4% gain when our model is applied in complex environments.

Xiaolong Wang, Abhinav Gupta• 2018

Related benchmarks

TaskDatasetResultRank
Action RecognitionKinetics-400
Top-1 Acc72.1
413
Action RecognitionSomething-something v1 (val)
Top-1 Acc46.1
257
Video ClassificationKinetics 400 (val)
Top-1 Acc77.7
204
Action RecognitionSomething-something v1 (test)
Top-1 Accuracy46.1
189
Action RecognitionSomething-Something V1
Top-1 Acc46.1
162
Video ClassificationKinetics-400
Top-1 Acc77.7
131
Video ClassificationSomething-something v1 (test)
Top-1 Accuracy46.1
115
Video ClassificationSomething-something v1 (val)
Top-1 Acc46.1
75
Action RecognitionCharades (val)
mAP39.7
69
Action RecognitionCharades
mAP0.397
64
Showing 10 of 33 rows

Other info

Follow for update