Videos as Space-Time Region Graphs

About

How do humans recognize the action "opening a book" ? We argue that there are two important cues: modeling temporal shape dynamics and modeling functional relationships between humans and objects. In this paper, we propose to represent videos as space-time region graphs which capture these two important cues. Our graph nodes are defined by the object region proposals from different frames in a long range video. These nodes are connected by two types of relations: (i) similarity relations capturing the long range dependencies between correlated objects and (ii) spatial-temporal relations capturing the interactions between nearby objects. We perform reasoning on this graph representation via Graph Convolutional Networks. We achieve state-of-the-art results on both Charades and Something-Something datasets. Especially for Charades, we obtain a huge 4.4% gain when our model is applied in complex environments.

Xiaolong Wang, Abhinav Gupta• 2018

Related benchmarks

Task	Dataset	Result
Action Recognition	Kinetics-400	Top-1 Acc72.1	498
Action Recognition	Something-something v1 (val)	Top-1 Acc46.1	257
Video Classification	Kinetics 400 (val)	Top-1 Acc77.7	204
Action Recognition	Something-something v1 (test)	Top-1 Accuracy46.1	189
Action Recognition	Something-Something V1	Top-1 Acc46.1	162
Video Classification	Kinetics-400	Top-1 Acc77.7	131
Video Classification	Something-something v1 (test)	Top-1 Accuracy46.1	115
Video Classification	Something-something v1 (val)	Top-1 Acc46.1	75
Action Recognition	Charades (val)	mAP39.7	69
Action Recognition	Charades	mAP0.397	64

Showing 10 of 33 rows

Other info

Follow for update

@wizwand_team Discord