Neural-Logic Human-Object Interaction Detection
About
The interaction decoder utilized in prevalent Transformer-based HOI detectors typically accepts pre-composed human-object pairs as inputs. Though achieving remarkable performance, such paradigm lacks feasibility and cannot explore novel combinations over entities during decoding. We present L OGIC HOI, a new HOI detector that leverages neural-logic reasoning and Transformer to infer feasible interactions between entities. Specifically, we modify the self-attention mechanism in vanilla Transformer, enabling it to reason over the <human, action, object> triplet and constitute novel interactions. Meanwhile, such reasoning process is guided by two crucial properties for understanding HOI: affordances (the potential actions an object can facilitate) and proxemics (the spatial relations between humans and objects). We formulate these two properties in first-order logic and ground them into continuous space to constrain the learning process of our approach, leading to improved performance and zero-shot generalization capabilities. We evaluate L OGIC HOI on V-COCO and HICO-DET under both normal and zero-shot setups, achieving significant improvements over existing methods.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human-Object Interaction Detection | HICO-DET (test) | mAP (full)35.47 | 493 | |
| Human-Object Interaction Detection | V-COCO (test) | AP (Role, Scenario 1)64.4 | 270 | |
| Human-Object Interaction Detection | HICO-DET | mAP (Full)35.47 | 233 | |
| Human-Object Interaction Detection | HICO-DET (Rare First Unseen Combination (RF-UC)) | mAP (Full)33.17 | 77 | |
| Human-Object Interaction Detection | HICO-DET Non-rare First Unseen Composition (NF-UC) | AP (Unseen)26.84 | 49 | |
| Human-Object Interaction Detection | HICO-DET (NF-UC) | mAP (Full)27.95 | 40 | |
| Human-Object Interaction Detection | HICO-DET (UO) | mAP (Full)28.23 | 31 | |
| Human-Object Interaction Detection | V-COCO | AP (Role)65.6 | 23 |