Videos as Space-Time Region Graphs
About
How do humans recognize the action "opening a book" ? We argue that there are two important cues: modeling temporal shape dynamics and modeling functional relationships between humans and objects. In this paper, we propose to represent videos as space-time region graphs which capture these two important cues. Our graph nodes are defined by the object region proposals from different frames in a long range video. These nodes are connected by two types of relations: (i) similarity relations capturing the long range dependencies between correlated objects and (ii) spatial-temporal relations capturing the interactions between nearby objects. We perform reasoning on this graph representation via Graph Convolutional Networks. We achieve state-of-the-art results on both Charades and Something-Something datasets. Especially for Charades, we obtain a huge 4.4% gain when our model is applied in complex environments.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Action Recognition | Kinetics-400 | Top-1 Acc72.1 | 413 | |
| Action Recognition | Something-something v1 (val) | Top-1 Acc46.1 | 257 | |
| Video Classification | Kinetics 400 (val) | Top-1 Acc77.7 | 204 | |
| Action Recognition | Something-something v1 (test) | Top-1 Accuracy46.1 | 189 | |
| Action Recognition | Something-Something V1 | Top-1 Acc46.1 | 162 | |
| Video Classification | Kinetics-400 | Top-1 Acc77.7 | 131 | |
| Video Classification | Something-something v1 (test) | Top-1 Accuracy46.1 | 115 | |
| Video Classification | Something-something v1 (val) | Top-1 Acc46.1 | 75 | |
| Action Recognition | Charades (val) | mAP39.7 | 69 | |
| Action Recognition | Charades | mAP0.397 | 64 |