Learning Human-Object Interactions by Graph Parsing Neural Networks
About
This paper addresses the task of detecting and recognizing human-object interactions (HOI) in images and videos. We introduce the Graph Parsing Neural Network (GPNN), a framework that incorporates structural knowledge while being differentiable end-to-end. For a given scene, GPNN infers a parse graph that includes i) the HOI graph structure represented by an adjacency matrix, and ii) the node labels. Within a message passing inference framework, GPNN iteratively computes the adjacency matrices and node labels. We extensively evaluate our model on three HOI detection benchmarks on images and videos: HICO-DET, V-COCO, and CAD-120 datasets. Our approach significantly outperforms state-of-art methods, verifying that GPNN is scalable to large datasets and applies to spatial-temporal settings. The code is available at https://github.com/SiyuanQi/gpnn.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human-Object Interaction Detection | HICO-DET (test) | mAP (full)13.11 | 493 | |
| Human-Object Interaction Detection | V-COCO (test) | AP (Role, Scenario 1)44 | 270 | |
| Human-Object Interaction Detection | HICO-DET | mAP (Full)13.11 | 233 | |
| Human-Object Interaction Detection | V-COCO 1.0 (test) | AP_role (#1)44 | 76 | |
| HOI Detection | HICO-DET (test) | Box mAP (Full)13.11 | 32 | |
| Human-Object Interaction Detection | V-COCO | Box mAP (Scenario 1)44 | 32 | |
| HOI Detection | VidHOI (val) | mAP Full18.47 | 23 | |
| Human-Object Interaction Detection | V-COCO | AP (Role)44 | 23 | |
| Human-Object Interaction Detection | HICO-DET 9 (test) | mAP (Full)13.11 | 21 | |
| Human-Object Interaction Detection | V-COCO standard (test) | AP (Role 1)44 | 18 |