Mining the Benefits of Two-stage and One-stage HOI Detection
About
Two-stage methods have dominated Human-Object Interaction (HOI) detection for several years. Recently, one-stage HOI detection methods have become popular. In this paper, we aim to explore the essential pros and cons of two-stage and one-stage methods. With this as the goal, we find that conventional two-stage methods mainly suffer from positioning positive interactive human-object pairs, while one-stage methods are challenging to make an appropriate trade-off on multi-task learning, i.e., object detection, and interaction classification. Therefore, a core problem is how to take the essence and discard the dregs from the conventional two types of methods. To this end, we propose a novel one-stage framework with disentangling human-object detection and interaction classification in a cascade manner. In detail, we first design a human-object pair generator based on a state-of-the-art one-stage HOI detector by removing the interaction classification module or head and then design a relatively isolated interaction classifier to classify each human-object pair. Two cascade decoders in our proposed framework can focus on one specific task, detection or interaction classification. In terms of the specific implementation, we adopt a transformer-based HOI detector as our base model. The newly introduced disentangling paradigm outperforms existing methods by a large margin, with a significant relative mAP gain of 9.32% on HICO-Det. The source codes are available at https://github.com/YueLiao/CDN.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human-Object Interaction Detection | HICO-DET (test) | mAP (full)32.1 | 493 | |
| Human-Object Interaction Detection | V-COCO (test) | AP (Role, Scenario 1)63.91 | 270 | |
| Human-Object Interaction Detection | HICO-DET | mAP (Full)34.53 | 233 | |
| Human-Object Interaction Detection | HICO-DET Known Object (test) | mAP (Full)34.79 | 112 | |
| Human-Object Interaction Detection | V-COCO 1.0 (test) | AP_role (#1)63.91 | 76 | |
| Human-Object Interaction Detection | V-COCO | AP^1 Role62.3 | 65 | |
| HOI Detection | V-COCO | AP Role 163.9 | 40 | |
| HOI Detection | HICO-DET | mAP (Rare)27.39 | 34 | |
| Human-Object Interaction Detection | HICO-DET 1 (test) | Full mAP34.53 | 33 | |
| Human-Object Interaction Detection | V-COCO | Box mAP (Scenario 1)63.9 | 32 |