Our new X account is live! Follow @wizwand_team for updates
WorkDL logo mark

Zero-shot HOI Detection with MLLM-based Detector-agnostic Interaction Recognition

About

Zero-shot Human-object interaction (HOI) detection aims to locate humans and objects in images and recognize their interactions. While advances in open-vocabulary object detection provide promising solutions for object localization, interaction recognition (IR) remains challenging due to the combinatorial diversity of interactions. Existing methods, including two-stage methods, tightly couple IR with a specific detector and rely on coarse-grained vision-language model (VLM) features, which limit generalization to unseen interactions. In this work, we propose a decoupled framework that separates object detection from IR and leverages multi-modal large language models (MLLMs) for zero-shot IR. We introduce a deterministic generation method that formulates IR as a visual question answering task and enforces deterministic outputs, enabling training-free zero-shot IR. To further enhance performance and efficiency by fine-tuning the model, we design a spatial-aware pooling module that integrates appearance and pairwise spatial cues, and a one-pass deterministic matching method that predicts all candidate interactions in a single forward pass. Extensive experiments on HICO-DET and V-COCO demonstrate that our method achieves superior zero-shot performance, strong cross-dataset generalization, and the flexibility to integrate with any object detectors without retraining. The codes are publicly available at https://github.com/SY-Xuan/DA-HOI.

Shiyu Xuan, Dongkai Wang, Zechao Li, Jinhui Tang• 2026

Related benchmarks

TaskDatasetResultRank
Human-Object Interaction DetectionHICO-DET
mAP (Full)46.21
233
Human-Object Interaction DetectionHICO-DET (Rare First Unseen Combination (RF-UC))
mAP (Full)44.81
77
Human-Object Interaction DetectionV-COCO--
65
Human-Object Interaction DetectionHICO-DET (NF-UC)
mAP (Full)42.01
40
Human-Object Interaction DetectionHICO-DET (UO)
mAP (Full)45.28
31
Human-Object Interaction DetectionHICO-DET (UV)
mAP (Full)44.43
30
Human-Object Interaction DetectionV-COCO v2 (test)
mAP (Role)59.91
7
Human-Object Interaction DetectionHOI Detection Dataset
Avg mAP44
6
Showing 8 of 8 rows

Other info

Follow for update