Bilateral Collaboration with Large Vision-Language Models for Open Vocabulary Human-Object Interaction Detection

About

Open vocabulary Human-Object Interaction (HOI) detection is a challenging task that detects all <human, verb, object> triplets of interest in an image, even those that are not pre-defined in the training set. Existing approaches typically rely on output features generated by large Vision-Language Models (VLMs) to enhance the generalization ability of interaction representations. However, the visual features produced by VLMs are holistic and coarse-grained, which contradicts the nature of detection tasks. To address this issue, we propose a novel Bilateral Collaboration framework for open vocabulary HOI detection (BC-HOI). This framework includes an Attention Bias Guidance (ABG) component, which guides the VLM to produce fine-grained instance-level interaction features according to the attention bias provided by the HOI detector. It also includes a Large Language Model (LLM)-based Supervision Guidance (LSG) component, which provides fine-grained token-level supervision for the HOI detector by the LLM component of the VLM. LSG enhances the ability of ABG to generate high-quality attention bias. We conduct extensive experiments on two popular benchmarks: HICO-DET and V-COCO, consistently achieving superior performance in the open vocabulary and closed settings. The code will be released in Github.

Yupeng Hu, Changxing Ding, Chang Sun, Shaoli Huang, Xiangmin Xu• 2025

Related benchmarks

Task	Dataset	Result
Human-Object Interaction Detection	HICO-DET	mAP (Full)43.01	263
Human-Object Interaction Detection	HICO-DET Known Object (test)	mAP (Full)45.35	118
Human-Object Interaction Detection	HICO-DET (Rare First Unseen Combination (RF-UC))	mAP (Full)40.99	77
Human-Object Interaction Detection	V-COCO	AP^1 Role68.2	65
Human-Object Interaction Detection	HICO-DET (NF-UC)	mAP (Full)36.4	56
Human-Object Interaction Detection	HICO-DET Non-rare First Unseen Composition (NF-UC)	AP (Unseen)33.01	49
Human-Object Interaction Detection	HICO-DET (UO)	mAP (Full)40.99	47
Human-Object Interaction Detection	HICO-DET (UV)	mAP (Full)39.89	30
Human-Object Interaction Detection	HICO-DET closed setting	Performance Score (Rare)45.76	18
Human-Object Interaction Detection	HOI Detection Dataset	Avg mAP37.87	6

Showing 10 of 10 rows

Other info

Follow for update

@wizwand_team Discord