Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection
About
Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.
Related benchmarks
| Task | Dataset | Result | Rank | |
|---|---|---|---|---|
| Human-Object Interaction Detection | HICO-DET | mAP (Full)43.01 | 233 | |
| Human-Object Interaction Detection | HICO-DET Known Object (test) | mAP (Full)45.35 | 112 | |
| Human-Object Interaction Detection | HICO-DET (Rare First Unseen Combination (RF-UC)) | mAP (Full)40.99 | 77 | |
| Human-Object Interaction Detection | V-COCO | AP^1 Role68.2 | 65 | |
| Human-Object Interaction Detection | HICO-DET Non-rare First Unseen Composition (NF-UC) | AP (Unseen)33.01 | 49 | |
| Human-Object Interaction Detection | HICO-DET (NF-UC) | mAP (Full)36.4 | 40 | |
| Human-Object Interaction Detection | HICO-DET (UO) | mAP (Full)34.18 | 31 | |
| Human-Object Interaction Detection | HICO-DET (UV) | mAP (Full)39.89 | 30 | |
| Human-Object Interaction Detection | HOI Detection Dataset | Avg mAP37.87 | 6 |