HOICLIP: Efficient Knowledge Transfer for HOI Detection with Vision-Language Models

About

Human-Object Interaction (HOI) detection aims to localize human-object pairs and recognize their interactions. Recently, Contrastive Language-Image Pre-training (CLIP) has shown great potential in providing interaction prior for HOI detectors via knowledge distillation. However, such approaches often rely on large-scale training data and suffer from inferior performance under few/zero-shot scenarios. In this paper, we propose a novel HOI detection framework that efficiently extracts prior knowledge from CLIP and achieves better generalization. In detail, we first introduce a novel interaction decoder to extract informative regions in the visual feature map of CLIP via a cross-attention mechanism, which is then fused with the detection backbone by a knowledge integration block for more accurate human-object pair detection. In addition, prior knowledge in CLIP text encoder is leveraged to generate a classifier by embedding HOI descriptions. To distinguish fine-grained interactions, we build a verb classifier from training data via visual semantic arithmetic and a lightweight verb representation adapter. Furthermore, we propose a training-free enhancement to exploit global HOI predictions from CLIP. Extensive experiments demonstrate that our method outperforms the state of the art by a large margin on various settings, e.g. +4.04 mAP on HICO-Det. The source code is available in https://github.com/Artanic30/HOICLIP.

Shan Ning, Longtian Qiu, Yongfei Liu, Xuming He• 2023

Related benchmarks

Task	Dataset	Result
Human-Object Interaction Detection	HICO-DET (test)	mAP (full)34.69	544
Human-Object Interaction Detection	V-COCO (test)	AP (Role, Scenario 1)63.5	270
Human-Object Interaction Detection	HICO-DET	mAP (Full)34.69	263
Human-Object Interaction Detection	HICO-DET Known Object (test)	mAP (Full)37.96	118
Human-Object Interaction Detection	HICO-DET (Rare First Unseen Combination (RF-UC))	mAP (Full)32.99	77
Human-Object Interaction Detection	V-COCO 1.0 (test)	AP_role (#1)63.5	76
Human-Object Interaction Detection	V-COCO	AP^1 Role63.5	65
Human-Object Interaction Detection	HICO-DET (NF-UC)	mAP (Full)29.93	56
Human-Object Interaction Detection	V-COCO	AP Role (Scenario 1)63.5	53
Human-Object Interaction Detection	HICO-DET Non-rare First Unseen Composition (NF-UC)	AP (Unseen)26.39	49

Showing 10 of 30 rows

Other info

Code

Follow for update

@wizwand_team Discord