Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

End-to-End Zero-Shot HOI Detection via Vision and Language Knowledge Distillation

About

Most existing Human-Object Interaction~(HOI) Detection methods rely heavily on full annotations with predefined HOI categories, which is limited in diversity and costly to scale further. We aim at advancing zero-shot HOI detection to detect both seen and unseen HOIs simultaneously. The fundamental challenges are to discover potential human-object pairs and identify novel HOI categories. To overcome the above challenges, we propose a novel end-to-end zero-shot HOI Detection (EoID) framework via vision-language knowledge distillation. We first design an Interactive Score module combined with a Two-stage Bipartite Matching algorithm to achieve interaction distinguishment for human-object pairs in an action-agnostic manner. Then we transfer the distribution of action probability from the pretrained vision-language teacher as well as the seen ground truth to the HOI model to attain zero-shot HOI classification. Extensive experiments on HICO-Det dataset demonstrate that our model discovers potential interactive pairs and enables the recognition of unseen HOIs. Finally, our method outperforms the previous SOTA by 8.92% on unseen mAP and 10.18% on overall mAP under UA setting, by 6.02% on unseen mAP and 9.1% on overall mAP under UC setting. Moreover, our method is generalizable to large-scale object detection data to further scale up the action sets. The source code will be available at: https://github.com/mrwu-mac/EoID.

Mingrui Wu, Jiaxin Gu, Yunhang Shen, Mingbao Lin, Chao Chen, Xiaoshuai Sun• 2022

Related benchmarks

TaskDatasetResultRank
Human-Object Interaction DetectionHICO-DET (Rare First Unseen Combination (RF-UC))
mAP (Full)29.52
77
Human-Object Interaction DetectionHICO-DET Non-rare First Unseen Composition (NF-UC)
AP (Unseen)26.77
49
Human-Object Interaction DetectionHICO-DET (NF-UC)
mAP (Full)28.91
40
Human-Object Interaction DetectionHICO-DET (UV)
mAP (Full)29.61
30
Human-Object Interaction DetectionHICO-DET Unseen Verb (UV)
Unseen Score22.71
11
Showing 5 of 5 rows

Other info

Follow for update